EMBL Outstation - The European Bioinformatics Institute
EMBL Nucleotide Sequence Database
Release Notes
Release 64 Sep 2000
EMBL Outstation
European Bioinformatics Institute
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
United Kingdom
Telephone: +44-1223-494400
Telefax : +44-1223-494468
Electronic mail: datalib@ebi.ac.uk
URL: http://www.ebi.ac.uk
CONTENTS
* 1 RELEASE 64
o 1.1 Nine Billion Nucleotides
o 1.2 Draft Human Genome
o 1.2.1 Base Quality Values
o 1.2.2 ENSEMBL automatic annotation
o 1.3 Genomes Web Server
o 1.4 Cross-Reference Information
o 1.5 Database Files
o 1.5.1 EST Database Files
o 1.5.2 GSS Database Files
o 1.5.3 HUM Database Files
o 1.5.4 HTG Database Files
o 1.6 Sequence Retrieval System (SRS6)
o 1.7 EMBL Database FAQ
o 1.8 Disclaimer
* 2 FORTHCOMING CHANGES
o 2.1 Genome Representation
o 2.2 New HTC (High Throughput cDNA) division
o 2.3 EMBL Cumulative Update File
o 2.4 Splitting HTG and GSS division files
o 2.5 Next version of SRS indices
* 3 SEQUENCE SUBMISSION SYSTEMS
o 3.1 Checking Sequence Data For Vector Contamination
o 3.2 WebIn - WWW Sequence Submission System
o 3.3 Bulk Submissions
o 3.4 SEQUIN - Stand-alone Submission Program
o 3.5 Sequence Alignment Submissions
o 3.6 Further Submission Information
o 3.6.1 Annotation Guides
* 4 CITING THE EMBL NUCLEOTIDE SEQUENCE DATABASE
* 5 EBI NETWORK SERVICES
o 5.1 Electronic Mail Server
o 5.2 Anonymous FTP Server
o 5.3 World Wide Web (WWW) Server
o 5.4 Sequence Similarity Search Servers
* 6 DISTRIBUTION FILES
o 6.1 Release 64 Files
o 6.2 SRS Indices
* APPENDIX A DATABASE GROWTH TABLE
1 RELEASE 64
The EMBL Nucleotide Sequence Database was frozen to make Release 64 on
02-Sep-2000. The release contains 8,344,436 sequence entries comprising
9,650,223,037 nucleotides. This represents an increase of about 16% over
Release 63. A breakdown of Release 64 by division is shown below:
Division Entries Nucleotides
----------------- ------------ ---------------
ESTs 5,565,880 2,194,418,599
Fungi 41,017 75,333,934
GSSs 1,717,212 950,099,606
HTG 77,671 4,263,600,014
Human 119,154 965,113,287
Invertebrates 54,900 329,846,226
Other Mammals 27,021 25,376,675
Organelles 72,962 61,665,029
Patents 207,677 67,411,887
Bacteriophage 1,595 4,385,850
Plants 68,956 221,131,770
Prokaryotes 86,977 218,928,626
Rodents 55,263 92,528,729
STSs 116,671 51,039,988
Synthetic 3,838 9,763,762
Unclassified 1,174 1,869,994
Viruses 102,523 90,011,114
Other Vertebrates 23,945 27,697,947
------------ ---------------
Total 8,344,436 9,650,223,037
1.1 Nine Billion Nucleotides
On 07-JUL-2000 the number of nucleotides in the EMBL Database has passed the
9,000,000,000 mark. Over the last 12 months (compare Oct 1, 1999: 3.6 Gigabases)
the database size has increased by more than 160%.
EMBL database statistics are available at URL:
http://www3.ebi.ac.uk/Services/DBStats/
1.2 Draft Human Genome and HTG division
The completion of the human draft genome sequence has been announced on
26-June-2000. The draft sequence data is available from the EMBL Database
HTG and HUM divisions.
The total size of the euchromatic portion of the genome is estimated to be 3.2
Gbases. The fact that the total score (FIN + UNFIN) exceeds the size of the
genome is due to redundancy, the general assumption is that about 30% - 40% of
the bases are redundant.
Below are the database statistics for finished and unfinished human sequence
in EMBL database from September 19, 2000.
YEAR FIN_TOTAL UNFIN_TOTAL FIN + UNFIN
------ --------- ----------- -----------
9/2000 910 Mb 3505 Mb 4415 Mb
See also the Genome Monitoring Table for further detailed information
available from the EBI at URL
http://www.ebi.ac.uk/Databases/Genome_MOT/genome_mot.html
1.2.1 Base quality values
Quality scores from draft HTG data are available on the EBI FTP server. The
gzip'ed files in the directory contain base quality values for unfinished human
sequences from Japanese, US and European sequencing centres. The FastA-type
headers contain the EMBL accession number/version of the corresponding database
entries.
Example:
>AL009030.9 Phrap Quality (Length:229022, Min: 3, Max: 99)
In order to keep the size of the files within reasonable limits for handling
purposes, files which in uncompressed form are bigger than 1 Gb, are split
into smaller files.
Directory: ftp://ftp.ebi.ac.uk/pub/databases/embl/quality_scores
Current Files: /htg_sanger1.qscore.gz - /htg_sanger3.qscore.gz
/htg_genoscope1.qscore.gz
/htg_mpimg1.qscore.gz
/htg_gbf1.qscore.gz
/htg_japan1.qscore.gz
/htg_us1.qscore.gz - /htg_us9.qscore.gz
Quality score files are updated on a daily basis.
1.2.2 Ensembl automatic annotation
Ensembl provides automatic annotation to the human draft genome data including
information on confirmed peptides, confirmed cDNAs and also predicted peptides.
Additionally, repeat prediction along with integration of map information and
SNPs are available.
Updated human genome resources spanning the entire working draft are now
available. Ensembl has released its automatic annotation for a June 15th
"frozen" data set at http://freeze.ensembl.org. This URL will now be the stable
location for all subsequent "frozen" dataset updates.
The Ensembl web site is available at URL http://www.ensembl.org/
Ensembl is a joint project between the Sanger Centre and EMBL-EBI.
1.3 Genome WEB Server
Access to completed genomes
The first completed genomes from viruses, phages and organelles were deposited
into the EMBL Database in the early 1980's. Since then, molecular biology's
shift to obtain the complete sequences of as many genomes as possible combined
with major developments in sequencing technology resulted in hundreds of
complete genome sequences being added to the database, including Archaea,
Eubacteria and Eukaryota. Recent additions include Buchnera sp. APS
(acc# BA000003) and Pseudomonas aeruginosa (acc# AE004091).
EBI's Genome Web Server provides easy access to completed genome sequences and
is available at URL: http://www.ebi.ac.uk/genomes/
Genome Monitoring Table
The Genome MOT presents the status of a number of large eukaryotic genome
sequencing projects. The tables are updated daily and also provide access to
EMBL database entries. The Genome MOTis available at URL:
http://www.ebi.ac.uk/Databases/Genome_MOT/genome_mot.html
1.4 Cross-Reference Information
Links to a growing list of external databases have been expanded allowing
integration with specialised data collections, such as protein databases,
species-specific databases, taxonomy databases etc. The WWW-based sequence
retrieval system (SRS) enable users to easily navigate between cross-referenced
database entries.
EMBL links to other databases:
Database Nr of links
---------- -----------
RZPD 2002574
TrEMBL 338688
Demeter 175252
SWISS-PROT 143124
MaizeDB 65929
FLYBASE 40968
IMGT/LIGM 37286
MENDEL 21033
GDB 8430
MGD 7998
TRANSFAC 6620
SGD 6029
EPD 3094
IMGT/HLA 2628
----------------------
Total 2859653
A list of URLs which conform with current DR line references is available:
Demeter http://ars-genome.cornell.edu
EPD http://www.epd.isb-sib.ch
FLYBASE http://www.fruitfly.org
GDB http://www.gdb.org
IMGT/HLA http://www.ebi.ac.uk/imgt/hla
IMGT/LIGM http://imgt.cines.fr:8104
MGD http://www.informatics.jax.org
MaizeDB http://www.agron.missouri.edu
MENDEL http://mbclserver.rutgers.edu/CPGN
RZPD http://www.rzpd.de
SGD http://genome-www.stanford.edu
SWISS-PROT http://www.expasy.ch
TRANSFAC http://transfac.gbf.de/TRANSFAC
TrEMBL http://www.ebi.ac.uk/swissprot/Information/information.html
1.5 Database Files
In order to keep the size of the data files within reasonable limits for
handling purposes, additional division files will be added in subsequent
releases as appropriate.
1.5.1 EST Database Files
EST files are now split according to taxonomic subdivisions following the model
of the taxonomic split of all other EMBL database divisions, e.g. Release 64
includes files
est_fun.dat Fungi ESTs
est_hum1.dat - est_hum23.dat Human ESTs
est_inv1.dat - est_inv4.dat Invertebrate ESTs
est_mam1.dat - est_mam2.dat Mammal ESTs
est_pln1.dat - est_pln8.dat Plant ESTs
est_pro.dat Prokaryote ESTs
est_rod1.dat - est_rod19.dat Rodent ESTs
est_vrt1.dat - est_vrt2.dat Vertebrate ESTs
This should reduce significantly the volume of data users have to parse in
order to extract ESTs for specific groups of organisms.
1.5.2 GSS Database Files
The GSS division has been split into 18 files (gss1.dat-gss18.dat).
1.5.3 HUM Database Files
The HUM division has been split into 6 files (hum1.dat-hum6.dat).
1.5.4 HTG Database Files
The HTG division has been split into 11 files (htgo.dat and htg1.dat-htg10.dat).
htgo.dat includes all HTGS_PHASE0 entries. These typically consist of one-to-few
pass reads of a single clone, have not been assembled into contigs and are
unoriented, unordered, unannotated and contain gaps with runs of 'N's separating
the reads. Low-pass sequence sampling is useful for identifying clones that may
be gene-rich. Phase0 sequences are used to check whether another center is
already sequencing this clone. If not, it will be sequenced through phase 1 and
phase 2. When records are updated, the accession numbers will be preserved.
Files htg1-htg10 include all other HTG entries (HTGS_PHASE1 - HTGS_PHASE2)
1.6 Sequence Retrieval System (SRS6)
As announced earlier EBI's SRS6 server is available at URL
http://srs.ebi.ac.uk/ now maps to http://srs6.ebi.ac.uk/.
All external services are available from the Tools button on EBI's Web pages.
If you have any comments and/or suggestions please send these to:
support@ebi.ac.uk
1.7 EMBL Database FAQ
An EMBL Database FAQ has been created and is available from the EBI at URL
http://www.ebi.ac.uk/embl/Documentation/FAQ/
This document includes information on:
General questions about EMBL and other databases
Submission procedure
Updating database entries
WEBIN-specific questions
Navigation guide
1.8 Disclaimer
No guarantee is given as to the completeness and accuracy of the database
entries, in particular the conformity of sequence data in the database with
the journal publication where the sequence is also disclosed.
2 FORTHCOMING CHANGES
2.1 Genome Representation
At the May 2000 Collaborative Meeting it was confirmed by the sequence database
collaboration DDBJ/EMBL/GenBank to go ahead to transform the currently existing
experimental FTP directory representing genome data into a database division
CON (Constructed Sequences) to represent complete genomes and other long
sequences constructed from segment entries. The CON division entries will
contain construct information (accession numbers and sequence locations)
involved in building the genomes. CON entries and according information will
be included into the daily data exchange mechanism between the collaborating
databases.
The CON entry file includes construct information and all accession numbers
relevant to the genome. Additionally, the complete entry in EMBL format
(DNA and features) plus the complete DNA sequence in Fasta format is provided.
These entries will be linked, searchable and retrievable through SRS and
available for BLAST and FASTA homology searching.
For an example representation, see the bacterial genome of Pseudomonas
aeruginosa (AE004091) in
ftp://ftp.ebi.ac.uk/pub/databases/embl/genomes/Bacteria/paeruginosa/
AE004091.con
AE004091.embl
AE004091.embl.Z
AE004091.fasta
AE004091.fasta.Z
2.2 New HTC (High Throughput cDNA) division
At the May 2000 collaborative meeting DDBJ/EMBL/GenBank agreed to create a new
database division HTC to represent unfinished High Throughput cDNA sequences.
HTC sequences may include 5'UTR and 3'UTR regions and (part of a) codin
region. Upon finishing of these sequences, they will be moved to the
corresponding taxonomic division. HTC sequence entries will include the keyword
'HTC'. The keyword will be removed once the entry has been included in the
taxonomic division.
2.3 EMBL cumulative update file
We intend to discontinue the provision of the single cumulative update file.
Several sites have reported problems handling our EMBL cumulative update file
when it grows beyond 2GB (uncompressed), because of file systems that do not
support files > 2Gb. Instead of the cumulative.dat.gz file, we will continue to
make available on our FTP server a set of smaller data files, that contain
together the same data as the full cumulative update file, named cum_*.dat.gz
For further details please check the README file in directory
ftp://ftp.ebi.ac.uk/pub/databases/embl/new/
2.4 Splitting HTG and GSS division files
We plan to split HTG and GSS division files according to taxonomic subdivisions
following the model of the taxonomic split of all other EMBL database divisions.
This should reduce significantly the volume of data users have to parse in order
to extract HTGs and GSSs for specific groups of organisms. Files will be named
accordingly e.g.
HTGS_PHASE0 sequences will be included in files htgo_hum.dat, htgo_inv.dat
htgo_rod.dat etc, while htgo.dat will include all remaining HTGS_PHASE0 entries.
HTGS_PHASE1 - HTGS_PHASE2 sequences will be included in files htg_hum.dat,
htg_inv.dat, htg_rod.dat etc while htg.dat will include all remaining HTG
entries.
GSS sequences will be included in files gss_fun.dat, gss_hum.dat etc, while
gss.dat will include all remaining GSS entries.
2.5 Next version of SRS indices
Please note that the next version of SRS indices will be for version 607x and not 606.
3 SEQUENCE SUBMISSION SYSTEMS
3.1 Checking Sequence Data For Vector Contamination
We urge submitters to remove vector contamination from sequence data before
submitting to the database. To assist submitters the EBI is providing a Vector
Screening Service using the latest implementation of the BLAST algorithm and a
special sequence databank known as EMVEC. EMVEC is an extraction of sequences
from the SYNthetic division of EMBL containing more than 2000 sequences
commonly used in cloning and sequencing experiments. EMVEC is by no means a
complete vector databank but EBI believes it is representative of the kind of
material used in modern sequencing and should be useful to submitters. The
databank will be updated with each release of EMBL and made publicly available
on the EBI's ftp server for those who wish to have it.
The interactive WWW service can be found at:
http://www.ebi.ac.uk/embl/Submission/webin.html
http://www.ebi.ac.uk/blastall/vectors.html
The results will list sequences producing significant alignments and associated
information like vector name, score, alignment etc
3.2 WebIn - WWW Sequence Submission System
WebIn is the preferred WWW Sequence Submission System for submitting nucleotide
sequence data and associated biological information to the EMBL Nucleotide
Sequence Database at the European Bioinformatics Institute(EBI). To access WebIn
at the EBI please use the following URL:
http://www.ebi.ac.uk/embl/Submission/webin.html
Database entries submitted to the EMBL Nucleotide Sequence Database at the EBI
will be exchanged and shared among the International Collaboration of Nucleotide
Sequence Databases (DDBJ/EMBL/GenBank).
WebIn guides the user through a sequence of WWW forms allowing the submission
of sequence data and descriptive information in an interactive and easy way.
All the information required to create a database entry will be collected
during this process:
1 Submitter Information
2 Release Date Information
3 Sequence Data, Description and Source Information
4 Reference Citation Information
5 Feature Information (e.g. coding regions, regulators,
signals etc.)
EBI staff will process data submissions within 2 working days and send the
database accession number(s) assigned to your data to your e-mail address.
3.3 Bulk Submissions
With the aim to make bulk sequence submission less time consuming for the
submitters, a new web-based submission system can now be accessed from the
WebIn page. Authors planning to submit a large number of similar sequences
(i.e.,>25) are presented with an option for "Bulk WebIn Submission". When
choosing thebulk path, submitters carry on the usual WebIn submission procedure
untilhaving finished a first and single representative sequence. During the
submission process database staff will interactively assist in making the
submission of this specific data as convenient as possible, thus saving the
author the time and effort required to complete numerous submission events
individually.
Alternatively, authors planning to submit very large numbers of similar
sequences should contact the database before submitting the data. Database
staff will create series of templates and communicate these to the author for
completion with just the information unique to each sequence required.
Please contact database staff if you require further information.
e-mail: datasubs@ebi.ac.uk
Tel: +44-1223-494499
Fax: +44-1223-494472
3.4 SEQUIN - Stand-alone Submission Program
Sequin is the multi-platform (Mac/PC/Unix) stand-alone software tool developed
by the NCBI for submitting entries to the EMBL, GenBank, or DDBJ sequence
databases. The Sequin program, along with detailed downloading and installation
instructions plus general information are available from the EBI via WWW and
anonymous FTP.
http://www3.ebi.ac.uk/Services/Sequin/
ftp://ftp.ebi.ac.uk/pub/software/sequin/
3.5 Sequence Alignment Submissions
The EBI accepts submissions of alignment data (e.g. from phylogenetic and
population analysis etc) of both nucleotide or amino-acid sequences, database
staff assigns an alignment number (e.g. ds38200), which is then communicated to
the submitter. We suggest that this number is quoted in the resulting
publication.
Alignment data and associated information are made available via EBI's network
servers (see below).
ALIGNMENT FORMATS:
As well as your alignment data we require information describing your alignment
(see table below) Please provide information for all fields.
Description Field Information required
TITLE: Title of alignment
SUBMITTER: Name, Affiliation, Phone, Fax, Email
RELEASE DATE: Public Immediately / if Confidential please
provide hold date
CITATION: If known please provide complete Author list,
Title, Journal, Year of publication, Page
numbers
ALIGNMENT METHOD: Method of alignment and format submitted,
parameters of alignment sequences used (if
appropriate)
DESCRIPTION OF e.g. Gaps indicated by a dash '-'
SYMBOLS:
DESCRIPTION OF Describe sequences aligned, including accession
ALIGNMENT: numbers (if known) and abbreviation of clones or
taxon used in alignment file. If your alignment
contains sequences derived from multiple
taxoonomic sources, please provide the full name
of each organism
FILE FORMAT:
We suggest submission in STANDARD ALIGNMENT FORMATS eg. (NEXUS, PHYLIP,
CLUSTALW etc) or Sequin output.
A sample alignment in NEXUS format can be viewed at
ftp://ftp.ebi.ac.uk/pub/databases/embl/align/ds32096.dat
NOTE 1: Alignments can be created within Sequin or imported into Sequin from
files in a standard alignment format like NEXUS or PHYLIP.
NOTE 2: If reporting new primary sequence data, we suggest that you submit
the complete individual sequence files (e.g. via Sequin or WebIn), in order to
include the sequence data as individual entries in the EMBL database. If gaps
have been introduced for the alignment, please leave them out when sending the
individual sequence files.
SENDING ALIGNMENT DATA to the EMBL Nucleotide Sequence Database
Sequence alignment data can be sent to the Nucleotide Sequence Database by
Electronic mail to datasubs@ebi.ac.uk
ACCESSING ALIGNMENT DATA
Alignment data and additional information are available via the EBI servers:
EBI WWW server:
http://www.ebi.ac.uk/embl/Submission/alignment.html
ftp://ftp.ebi.ac.uk/pub/databases/embl/align/
EBI FTP server: by anonymous FTP from FTP.EBI.AC.UK in directory
pub/databases/embl/align
EBI File server: by sending an e-mail message to netserv@ebi.ac.uk
including the line HELP ALIGN or GET ALIGN:DS8200.DAT
3.6 Further Submission Information
3.6.1 Annotation Guides
To help and guide submitters in annotating their sequences, two online guides
are available via hyperlinks from within WebIn:
EMBL Annotation Examples (http://www3.ebi.ac.uk/Services/Standards/web/) and
EMBL Features and Qualifiers (http://www3.ebi.ac.uk/Services/WebFeat/). The
annotation examples consist of a list of EMBL approved feature table
annotations for common biological sequences. The EMBL Features and Qualifiers
is a complete list of feature table key and qualifier definitions providing
detailed descriptions, mandatory and optional qualifiers and usage examples.
For further information on submission of sequence data to the EMBL Nucleotide
Sequence Database please access:
http://www.ebi.ac.uk/embl/Submission/
or contact database staff at:
EMBL Nucleotide Sequence Submissions
e-mail: datasubs@ebi.ac.uk
telephone: +44-1223-494499
telefax: +44-1223-494472
4 CITING THE EMBL NUCLEOTIDE SEQUENCE DATABASE
We encourage authors to include a reference to the EMBL Database in
publications related to their research.
When citing data in the EMBL Database, we suggest to give the according
primary accession and the publication in which the sequence first appeared.
For unpublished data, we suggest to contact the original submitters for
recent publication information or revisions of the data.
We suggest to also provide a reference for the EMBL Database itself. Our
recent publication describing the EMBL database should be cited:
Baker W., van den Broek, A., Camon E., Hingamp P., Sterk P., Stoesser G.,
and Tuli M.A.. 'The EMBL Nucleotide Sequence Database',
Nucl. Acids Res., 28 (1), 19-23 (2000).
Example: The numbers in parentheses refer to the REFERENCE in the EMBL
database entry, and to the EMBL citation above.
"Sequence entry X56734 (1) has been retrieved from the EMBL Database (2)
and showed significant sequence similarity to ..."
(1) Oxtoby, E., et al., Plant Mol. Biol. 17:209-219(1991).
(2) Baker, W., et al., Nucl. Acids Res. 28:19-23(2000)
5 EBI NETWORK SERVICES
5.1 Electronic Mail Server
Computer users with access to Internet (directly or via a gateway) can obtain
copies of database entries, documentation or the data submission form, by
sending commands to a file server running at EBI. New and updated EMBL
nucleotide sequence entries are made available on the server on a daily basis.
To use this facility, send file server commands (as electronic mail) to the
address netserv@ebi.ac.uk. Each line of the mail message should consist of a
single file server request.
The most important file server request, to get started, is:
HELP
If the file server receives this command, it will return a helpfile to the
sender, explaining in some detail how to use the facility. For example, to
request a copy of the nucleotide sequence with accession number X55652, use
the command:
GET NUC:X55652
The file server offers various other services, (eg., access to nucleotide and
protein sequence data, protein structure data, software), details of which are
provided in the HELP file.
5.2 Anonymous FTP Server
An alternative method of accessing the EBI archives is to use the Internet
File transfer protocol (ftp). Researchers with direct access to the Internet
can use the FTP program on their local machine to connect to the host
FTP.EBI.AC.UK and enter the username "anonymous" and their email address as
password.
The directory pub/help contains detailed information about the data available
from the EBI anonymous FTP server which includes the complete EMBL Nucleotide
Sequence Database releases as well as daily and weekly updates and a cumulative
update file (in UNIX-compressed format)in the following directories:
EMBL quarterly release: pub/databases/embl/release
EMBL updates: pub/databases/embl/new
5.3 World Wide Web (WWW) Server
The EBI operates a WWW server with URL http://www.ebi.ac.uk/ which gives access
to information about the EBI and it's products and services. Nucleotide
sequences can be retrieved by a simple query by accession number, or more
complex queries can be contructed using an SRS WWW databank browser. Nucleotide
sequences can also be submitted to the database using the interactive submission
system WebIn at URL:
http://www.ebi.ac.uk/embl/Submission/webin.html
5.4 Sequence Similarity Search Servers
The EBI offers two network servers for sequence similarity searches via
electronic mail or interactive WWW forms:
FASTA based on W. Pearson's FASTA algorithm. Allows local similarity
searches of protein and nucleotide sequence databases.
Send "help" to Fasta@EBI.AC.UK or use
URL http://www.ebi.ac.uk/fasta3/
BLAST based on the NCBI and WU-Blast software Send "help" to
Blast@EBI.AC.UK or use URL http://www.ebi.ac.uk/blast2/
BLITZ allows very fast searches of protein sequence databases for
local similarities using an exhaustive Smith-Waterman matching
algorithm. Compugen's BIC_SW software is running on a
Biocellerator (BIC-2) Send "help" to Blitz@EBI.AC.UK or
use URL http://www.ebi.ac.uk/bic_sw/
6 DISTRIBUTION FILES
6.1 Release 64 Files
The release contains the files shown below, in the order listed. File sizes are
given as numbers of records.
File Number File Name Description Number of Records
1 DELETEAC.TXT Deleted accession numbers 44649
2 FTABLE.TXT Feature Table Documentation 465
3 RELNOTES.TXT Release Notes (this document) 915
4 SUBFORM.TXT Data Submission Form 418
5 SUBINFO.TXT Data Submission Documentation 333
6 UPDATE.TXT Data Update Form 107
7 USRMAN.TXT User Manual 1469
8 ACNUMBER.NDX Accession Number Index 8372365
9 CITATION.NDX Citation Index 1872434
10 DIVISION.NDX Division Index 23
11 KEYWORD.NDX Keyword Index 3109242
12 SHORTDIR.NDX Short Directory Index 21428207
13 SPECIES.NDX Species Index 2888410
14 EST_FUN.DAT EST Sequences 3491596
15 EST_HUM1.DAT EST Sequences 7242162
16 EST_HUM2.DAT EST Sequences 7383411
17 EST_HUM3.DAT EST Sequences 7092087
18 EST_HUM4.DAT EST Sequences 6958043
19 EST_HUM5.DAT EST Sequences 7086795
20 EST_HUM6.DAT EST Sequences 7098043
21 EST_HUM7.DAT EST Sequences 7136249
22 EST_HUM8.DAT EST Sequences 7031857
23 EST_HUM9.DAT EST Sequences 7156374
24 EST_HUM10.DAT EST Sequences 6859020
25 EST_HUM11.DAT EST Sequences 6661083
26 EST_HUM12.DAT EST Sequences 6431484
27 EST_HUM13.DAT EST Sequences 6811351
28 EST_HUM14.DAT EST Sequences 6856402
29 EST_HUM15.DAT EST Sequences 7036586
30 EST_HUM16.DAT EST Sequences 7306475
31 EST_HUM17.DAT EST Sequences 7263236
32 EST_HUM18.DAT EST Sequences 7357458
33 EST_HUM19.DAT EST Sequences 7444208
34 EST_HUM20.DAT EST Sequences 7476190
35 EST_HUM21.DAT EST Sequences 6699624
36 EST_HUM22.DAT EST Sequences 6963358
37 EST_HUM23.DAT EST Sequences 4588499
38 EST_INV1.DAT EST Sequences 6431773
39 EST_INV2.DAT EST Sequences 6042873
40 EST_INV3.DAT EST Sequences 6293598
41 EST_INV4.DAT EST Sequences 4046341
42 EST_MAM1.DAT EST Sequences 6114230
43 EST_MAM2.DAT EST Sequences 2356039
44 EST_PLN1.DAT EST Sequences 6750911
45 EST_PLN2.DAT EST Sequences 6219344
46 EST_PLN3.DAT EST Sequences 5830564
47 EST_PLN4.DAT EST Sequences 7215994
48 EST_PLN5.DAT EST Sequences 7046836
49 EST_PLN6.DAT EST Sequences 6762278
50 EST_PLN7.DAT EST Sequences 6720107
51 EST_PLN8.DAT EST Sequences 6029205
52 EST_PRO.DAT EST Sequences 38548
53 EST_ROD1.DAT EST Sequences 7331559
54 EST_ROD2.DAT EST Sequences 7567611
55 EST_ROD3.DAT EST Sequences 7220551
56 EST_ROD4.DAT EST Sequences 7549688
57 EST_ROD5.DAT EST Sequences 6811012
58 EST_ROD6.DAT EST Sequences 7086810
59 EST_ROD7.DAT EST Sequences 9771985
60 EST_ROD8.DAT EST Sequences 9130283
61 EST_ROD9.DAT EST Sequences 7665029
62 EST_ROD10.DAT EST Sequences 9177208
63 EST_ROD11.DAT EST Sequences 9743196
64 EST_ROD12.DAT EST Sequences 9700691
65 EST_ROD13.DAT EST Sequences 9653685
66 EST_ROD14.DAT EST Sequences 9473210
67 EST_ROD15.DAT EST Sequences 9015774
68 EST_ROD16.DAT EST Sequences 6666497
69 EST_ROD17.DAT EST Sequences 7649778
70 EST_ROD18.DAT EST Sequences 7420422
71 EST_ROD19.DAT EST Sequences 738690
72 EST_VRT1.DAT EST Sequences 7641169
73 EST_VRT2.DAT EST Sequences 2254064
74 FUN.DAT Fungi Sequences 3736027
75 GSS1.DAT Genome Survey Sequences 6116578
76 GSS2.DAT Genome Survey Sequences 6118824
77 GSS3.DAT Genome Survey Sequences 6268149
78 GSS4.DAT Genome Survey Sequences 6628318
79 GSS5.DAT Genome Survey Sequences 6554451
80 GSS6.DAT Genome Survey Sequences 6616068
81 GSS7.DAT Genome Survey Sequences 6639716
82 GSS8.DAT Genome Survey Sequences 6644800
83 GSS9.DAT Genome Survey Sequences 6958158
84 GSS10.DAT Genome Survey Sequences 6788195
85 GSS11.DAT Genome Survey Sequences 7155659
86 GSS12.DAT Genome Survey Sequences 6988978
87 GSS13.DAT Genome Survey Sequences 6978243
88 GSS14.DAT Genome Survey Sequences 6402203
89 GSS15.DAT Genome Survey Sequences 6646868
90 GSS16.DAT Genome Survey Sequences 7448747
91 GSS17.DAT Genome Survey Sequences 6669805
92 GSS18.DAT Genome Survey Sequences 1027489
93 HTG1.DAT High Throughput Genome Sequences 7854248
94 HTG2.DAT High Throughput Genome Sequences 5995734
95 HTG3.DAT High Throughput Genome Sequences 4210260
96 HTG4.DAT High Throughput Genome Sequences 4724917
97 HTG5.DAT High Throughput Genome Sequences 8718298
98 HTG6.DAT High Throughput Genome Sequences 8721834
99 HTG7.DAT High Throughput Genome Sequences 8979368
100 HTG8.DAT High Throughput Genome Sequences 8137472
101 HTG9.DAT High Throughput Genome Sequences 7846179
102 HTG10.DAT High Throughput Genome Sequences 4273070
103 HTGO.dat High Throughput Genome Sequences 8701440
104 HUM1.DAT Human Sequences 9494007
105 HUM2.DAT Human Sequences 5320579
106 HUM3.DAT Human Sequences 3561983
107 HUM4.DAT Human Sequences 2858503
108 HUM5.DAT Human Sequences 2298449
109 HUM6.DAT Human Sequences 1644433
110 INV.DAT Invertebrate Sequences 9495348
111 MAM.DAT Other Mammal Sequences 1908267
112 ORG.DAT Organelle Sequences 5140625
113 PATENT.DAT Patent Sequences 8110279
114 PHG.DAT Bacteriophage Sequences 217840
115 PLN.DAT Plant Sequences 8269953
116 PRO1.DAT Prokaryote Sequences 6104496
117 PRO2.DAT Prokaryote Sequences 4233076
118 ROD.DAT Rodent Sequences 4755562
119 STS.DAT STS Sequences 7970081
120 SYN.DAT Synthetic Sequences 394629
121 UNC.DAT Unclassified Sequences 106371
122 VRL.DAT Viral Sequences 7545287
123 VRT.DAT Other Vertebrate Sequences 1787491
6.2 SRS Indices
SRS indices can be found on the FTP server in the srs directory
ftp://ftp.ebi.ac.uk/pub/databases/embl/release/srs/.
See README file for details.
Please note that the next version of SRS indices will be for version 607x
and not 606.
APPENDIX A
DATABASE GROWTH TABLE
The following table shows the growth of the EMBL Nucleotide Sequence
Database at each release.
Release Month Entries Nucleotides
1 06/1982 568 585433
2 04/1983 811 1114447
3 12/1983 1481 1654863
4 08/1984 1698 2147205
5 04/1985 2378 2874493
6 08/1985 4835 4567592
7 12/1985 5789 5622638
8 04/1986 6395 6353040
9 09/1986 7630 7813214
10 12/1986 8817 9766948
11 04/1987 11621 12189783
12 07/1987 12706 13638061
13 10/1987 14397 16023478
14 01/1988 15344 17272160
15 05/1988 17961 20318442
16 08/1988 19592 22625941
17 11/1988 20695 24211054
18 02/1989 22938 27249830
19 05/1989 24365 29066676
20 08/1989 26223 31240948
21 11/1989 28679 34748087
22 02/1990 31508 38165786
23 05/1990 34902 42923803
24 08/1990 37784 47354438
25 11/1990 41580 52900354
26 02/1991 43745 55859549
27 05/1991 46871 59915244
28 09/1991 54558 70448052
29 12/1991 57655 75400487
30 03/1992 63378 83574342
31 06/1992 72481 94390065
32 09/1992 79377 101292310
33 12/1992 89100 111413979
34 03/1993 99591 121420828
35 06/1993 108973 131880111
36 09/1993 127933 145401156
37 12/1993 146576 158171400
38 03/1994 167777 177550115
39 06/1994 182615 192195819
40 09/1994 209352 211017104
41 12/1994 230950 226259607
42 03/1995 303206 262559786
43 06/1995 420111 315840053
44 09/1995 506190 363273777
45 12/1995 622566 427620278
46 03/1996 701246 473691480
47 06/1996 827174 550739395
48 09/1996 928067 608931850
49 12/1996 1047263 696183789
50 03/1997 1187455 789755858
51 06/1997 1432941 931351601
52 10/1997 1787004 1181167498
53 12/1997 1917868 1281391651
54 03/1998 2125225 1427634373
55 06/1998 2330040 1607673907
56 09/1998 2689618 1904091473
57 12/1998 3046471 2164718256
58 03/1999 3272064 2355200790
59 06/1999 3952878 2924568545
60 09/1999 4719266 3543553093
61 12/1999 5303436 4508169737
62 03/2000 5865742 6120908677
63 06/2000 6760113 8255674441
64 09/2000 8344436 9650223037