EMBL Outstation - The European Bioinformatics Institute EMBL Nucleotide Sequence Database Release Notes Release 64 Sep 2000 EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD United Kingdom Telephone: +44-1223-494400 Telefax : +44-1223-494468 Electronic mail: datalib@ebi.ac.uk URL: http://www.ebi.ac.uk CONTENTS * 1 RELEASE 64 o 1.1 Nine Billion Nucleotides o 1.2 Draft Human Genome o 1.2.1 Base Quality Values o 1.2.2 ENSEMBL automatic annotation o 1.3 Genomes Web Server o 1.4 Cross-Reference Information o 1.5 Database Files o 1.5.1 EST Database Files o 1.5.2 GSS Database Files o 1.5.3 HUM Database Files o 1.5.4 HTG Database Files o 1.6 Sequence Retrieval System (SRS6) o 1.7 EMBL Database FAQ o 1.8 Disclaimer * 2 FORTHCOMING CHANGES o 2.1 Genome Representation o 2.2 New HTC (High Throughput cDNA) division o 2.3 EMBL Cumulative Update File o 2.4 Splitting HTG and GSS division files o 2.5 Next version of SRS indices * 3 SEQUENCE SUBMISSION SYSTEMS o 3.1 Checking Sequence Data For Vector Contamination o 3.2 WebIn - WWW Sequence Submission System o 3.3 Bulk Submissions o 3.4 SEQUIN - Stand-alone Submission Program o 3.5 Sequence Alignment Submissions o 3.6 Further Submission Information o 3.6.1 Annotation Guides * 4 CITING THE EMBL NUCLEOTIDE SEQUENCE DATABASE * 5 EBI NETWORK SERVICES o 5.1 Electronic Mail Server o 5.2 Anonymous FTP Server o 5.3 World Wide Web (WWW) Server o 5.4 Sequence Similarity Search Servers * 6 DISTRIBUTION FILES o 6.1 Release 64 Files o 6.2 SRS Indices * APPENDIX A DATABASE GROWTH TABLE 1 RELEASE 64 The EMBL Nucleotide Sequence Database was frozen to make Release 64 on 02-Sep-2000. The release contains 8,344,436 sequence entries comprising 9,650,223,037 nucleotides. This represents an increase of about 16% over Release 63. A breakdown of Release 64 by division is shown below: Division Entries Nucleotides ----------------- ------------ --------------- ESTs 5,565,880 2,194,418,599 Fungi 41,017 75,333,934 GSSs 1,717,212 950,099,606 HTG 77,671 4,263,600,014 Human 119,154 965,113,287 Invertebrates 54,900 329,846,226 Other Mammals 27,021 25,376,675 Organelles 72,962 61,665,029 Patents 207,677 67,411,887 Bacteriophage 1,595 4,385,850 Plants 68,956 221,131,770 Prokaryotes 86,977 218,928,626 Rodents 55,263 92,528,729 STSs 116,671 51,039,988 Synthetic 3,838 9,763,762 Unclassified 1,174 1,869,994 Viruses 102,523 90,011,114 Other Vertebrates 23,945 27,697,947 ------------ --------------- Total 8,344,436 9,650,223,037 1.1 Nine Billion Nucleotides On 07-JUL-2000 the number of nucleotides in the EMBL Database has passed the 9,000,000,000 mark. Over the last 12 months (compare Oct 1, 1999: 3.6 Gigabases) the database size has increased by more than 160%. EMBL database statistics are available at URL: http://www3.ebi.ac.uk/Services/DBStats/ 1.2 Draft Human Genome and HTG division The completion of the human draft genome sequence has been announced on 26-June-2000. The draft sequence data is available from the EMBL Database HTG and HUM divisions. The total size of the euchromatic portion of the genome is estimated to be 3.2 Gbases. The fact that the total score (FIN + UNFIN) exceeds the size of the genome is due to redundancy, the general assumption is that about 30% - 40% of the bases are redundant. Below are the database statistics for finished and unfinished human sequence in EMBL database from September 19, 2000. YEAR FIN_TOTAL UNFIN_TOTAL FIN + UNFIN ------ --------- ----------- ----------- 9/2000 910 Mb 3505 Mb 4415 Mb See also the Genome Monitoring Table for further detailed information available from the EBI at URL http://www.ebi.ac.uk/Databases/Genome_MOT/genome_mot.html 1.2.1 Base quality values Quality scores from draft HTG data are available on the EBI FTP server. The gzip'ed files in the directory contain base quality values for unfinished human sequences from Japanese, US and European sequencing centres. The FastA-type headers contain the EMBL accession number/version of the corresponding database entries. Example: >AL009030.9 Phrap Quality (Length:229022, Min: 3, Max: 99) In order to keep the size of the files within reasonable limits for handling purposes, files which in uncompressed form are bigger than 1 Gb, are split into smaller files. Directory: ftp://ftp.ebi.ac.uk/pub/databases/embl/quality_scores Current Files: /htg_sanger1.qscore.gz - /htg_sanger3.qscore.gz /htg_genoscope1.qscore.gz /htg_mpimg1.qscore.gz /htg_gbf1.qscore.gz /htg_japan1.qscore.gz /htg_us1.qscore.gz - /htg_us9.qscore.gz Quality score files are updated on a daily basis. 1.2.2 Ensembl automatic annotation Ensembl provides automatic annotation to the human draft genome data including information on confirmed peptides, confirmed cDNAs and also predicted peptides. Additionally, repeat prediction along with integration of map information and SNPs are available. Updated human genome resources spanning the entire working draft are now available. Ensembl has released its automatic annotation for a June 15th "frozen" data set at http://freeze.ensembl.org. This URL will now be the stable location for all subsequent "frozen" dataset updates. The Ensembl web site is available at URL http://www.ensembl.org/ Ensembl is a joint project between the Sanger Centre and EMBL-EBI. 1.3 Genome WEB Server Access to completed genomes The first completed genomes from viruses, phages and organelles were deposited into the EMBL Database in the early 1980's. Since then, molecular biology's shift to obtain the complete sequences of as many genomes as possible combined with major developments in sequencing technology resulted in hundreds of complete genome sequences being added to the database, including Archaea, Eubacteria and Eukaryota. Recent additions include Buchnera sp. APS (acc# BA000003) and Pseudomonas aeruginosa (acc# AE004091). EBI's Genome Web Server provides easy access to completed genome sequences and is available at URL: http://www.ebi.ac.uk/genomes/ Genome Monitoring Table The Genome MOT presents the status of a number of large eukaryotic genome sequencing projects. The tables are updated daily and also provide access to EMBL database entries. The Genome MOTis available at URL: http://www.ebi.ac.uk/Databases/Genome_MOT/genome_mot.html 1.4 Cross-Reference Information Links to a growing list of external databases have been expanded allowing integration with specialised data collections, such as protein databases, species-specific databases, taxonomy databases etc. The WWW-based sequence retrieval system (SRS) enable users to easily navigate between cross-referenced database entries. EMBL links to other databases: Database Nr of links ---------- ----------- RZPD 2002574 TrEMBL 338688 Demeter 175252 SWISS-PROT 143124 MaizeDB 65929 FLYBASE 40968 IMGT/LIGM 37286 MENDEL 21033 GDB 8430 MGD 7998 TRANSFAC 6620 SGD 6029 EPD 3094 IMGT/HLA 2628 ---------------------- Total 2859653 A list of URLs which conform with current DR line references is available: Demeter http://ars-genome.cornell.edu EPD http://www.epd.isb-sib.ch FLYBASE http://www.fruitfly.org GDB http://www.gdb.org IMGT/HLA http://www.ebi.ac.uk/imgt/hla IMGT/LIGM http://imgt.cines.fr:8104 MGD http://www.informatics.jax.org MaizeDB http://www.agron.missouri.edu MENDEL http://mbclserver.rutgers.edu/CPGN RZPD http://www.rzpd.de SGD http://genome-www.stanford.edu SWISS-PROT http://www.expasy.ch TRANSFAC http://transfac.gbf.de/TRANSFAC TrEMBL http://www.ebi.ac.uk/swissprot/Information/information.html 1.5 Database Files In order to keep the size of the data files within reasonable limits for handling purposes, additional division files will be added in subsequent releases as appropriate. 1.5.1 EST Database Files EST files are now split according to taxonomic subdivisions following the model of the taxonomic split of all other EMBL database divisions, e.g. Release 64 includes files est_fun.dat Fungi ESTs est_hum1.dat - est_hum23.dat Human ESTs est_inv1.dat - est_inv4.dat Invertebrate ESTs est_mam1.dat - est_mam2.dat Mammal ESTs est_pln1.dat - est_pln8.dat Plant ESTs est_pro.dat Prokaryote ESTs est_rod1.dat - est_rod19.dat Rodent ESTs est_vrt1.dat - est_vrt2.dat Vertebrate ESTs This should reduce significantly the volume of data users have to parse in order to extract ESTs for specific groups of organisms. 1.5.2 GSS Database Files The GSS division has been split into 18 files (gss1.dat-gss18.dat). 1.5.3 HUM Database Files The HUM division has been split into 6 files (hum1.dat-hum6.dat). 1.5.4 HTG Database Files The HTG division has been split into 11 files (htgo.dat and htg1.dat-htg10.dat). htgo.dat includes all HTGS_PHASE0 entries. These typically consist of one-to-few pass reads of a single clone, have not been assembled into contigs and are unoriented, unordered, unannotated and contain gaps with runs of 'N's separating the reads. Low-pass sequence sampling is useful for identifying clones that may be gene-rich. Phase0 sequences are used to check whether another center is already sequencing this clone. If not, it will be sequenced through phase 1 and phase 2. When records are updated, the accession numbers will be preserved. Files htg1-htg10 include all other HTG entries (HTGS_PHASE1 - HTGS_PHASE2) 1.6 Sequence Retrieval System (SRS6) As announced earlier EBI's SRS6 server is available at URL http://srs.ebi.ac.uk/ now maps to http://srs6.ebi.ac.uk/. All external services are available from the Tools button on EBI's Web pages. If you have any comments and/or suggestions please send these to: support@ebi.ac.uk 1.7 EMBL Database FAQ An EMBL Database FAQ has been created and is available from the EBI at URL http://www.ebi.ac.uk/embl/Documentation/FAQ/ This document includes information on: General questions about EMBL and other databases Submission procedure Updating database entries WEBIN-specific questions Navigation guide 1.8 Disclaimer No guarantee is given as to the completeness and accuracy of the database entries, in particular the conformity of sequence data in the database with the journal publication where the sequence is also disclosed. 2 FORTHCOMING CHANGES 2.1 Genome Representation At the May 2000 Collaborative Meeting it was confirmed by the sequence database collaboration DDBJ/EMBL/GenBank to go ahead to transform the currently existing experimental FTP directory representing genome data into a database division CON (Constructed Sequences) to represent complete genomes and other long sequences constructed from segment entries. The CON division entries will contain construct information (accession numbers and sequence locations) involved in building the genomes. CON entries and according information will be included into the daily data exchange mechanism between the collaborating databases. The CON entry file includes construct information and all accession numbers relevant to the genome. Additionally, the complete entry in EMBL format (DNA and features) plus the complete DNA sequence in Fasta format is provided. These entries will be linked, searchable and retrievable through SRS and available for BLAST and FASTA homology searching. For an example representation, see the bacterial genome of Pseudomonas aeruginosa (AE004091) in ftp://ftp.ebi.ac.uk/pub/databases/embl/genomes/Bacteria/paeruginosa/ AE004091.con AE004091.embl AE004091.embl.Z AE004091.fasta AE004091.fasta.Z 2.2 New HTC (High Throughput cDNA) division At the May 2000 collaborative meeting DDBJ/EMBL/GenBank agreed to create a new database division HTC to represent unfinished High Throughput cDNA sequences. HTC sequences may include 5'UTR and 3'UTR regions and (part of a) codin region. Upon finishing of these sequences, they will be moved to the corresponding taxonomic division. HTC sequence entries will include the keyword 'HTC'. The keyword will be removed once the entry has been included in the taxonomic division. 2.3 EMBL cumulative update file We intend to discontinue the provision of the single cumulative update file. Several sites have reported problems handling our EMBL cumulative update file when it grows beyond 2GB (uncompressed), because of file systems that do not support files > 2Gb. Instead of the cumulative.dat.gz file, we will continue to make available on our FTP server a set of smaller data files, that contain together the same data as the full cumulative update file, named cum_*.dat.gz For further details please check the README file in directory ftp://ftp.ebi.ac.uk/pub/databases/embl/new/ 2.4 Splitting HTG and GSS division files We plan to split HTG and GSS division files according to taxonomic subdivisions following the model of the taxonomic split of all other EMBL database divisions. This should reduce significantly the volume of data users have to parse in order to extract HTGs and GSSs for specific groups of organisms. Files will be named accordingly e.g. HTGS_PHASE0 sequences will be included in files htgo_hum.dat, htgo_inv.dat htgo_rod.dat etc, while htgo.dat will include all remaining HTGS_PHASE0 entries. HTGS_PHASE1 - HTGS_PHASE2 sequences will be included in files htg_hum.dat, htg_inv.dat, htg_rod.dat etc while htg.dat will include all remaining HTG entries. GSS sequences will be included in files gss_fun.dat, gss_hum.dat etc, while gss.dat will include all remaining GSS entries. 2.5 Next version of SRS indices Please note that the next version of SRS indices will be for version 607x and not 606. 3 SEQUENCE SUBMISSION SYSTEMS 3.1 Checking Sequence Data For Vector Contamination We urge submitters to remove vector contamination from sequence data before submitting to the database. To assist submitters the EBI is providing a Vector Screening Service using the latest implementation of the BLAST algorithm and a special sequence databank known as EMVEC. EMVEC is an extraction of sequences from the SYNthetic division of EMBL containing more than 2000 sequences commonly used in cloning and sequencing experiments. EMVEC is by no means a complete vector databank but EBI believes it is representative of the kind of material used in modern sequencing and should be useful to submitters. The databank will be updated with each release of EMBL and made publicly available on the EBI's ftp server for those who wish to have it. The interactive WWW service can be found at: http://www.ebi.ac.uk/embl/Submission/webin.html http://www.ebi.ac.uk/blastall/vectors.html The results will list sequences producing significant alignments and associated information like vector name, score, alignment etc 3.2 WebIn - WWW Sequence Submission System WebIn is the preferred WWW Sequence Submission System for submitting nucleotide sequence data and associated biological information to the EMBL Nucleotide Sequence Database at the European Bioinformatics Institute(EBI). To access WebIn at the EBI please use the following URL: http://www.ebi.ac.uk/embl/Submission/webin.html Database entries submitted to the EMBL Nucleotide Sequence Database at the EBI will be exchanged and shared among the International Collaboration of Nucleotide Sequence Databases (DDBJ/EMBL/GenBank). WebIn guides the user through a sequence of WWW forms allowing the submission of sequence data and descriptive information in an interactive and easy way. All the information required to create a database entry will be collected during this process: 1 Submitter Information 2 Release Date Information 3 Sequence Data, Description and Source Information 4 Reference Citation Information 5 Feature Information (e.g. coding regions, regulators, signals etc.) EBI staff will process data submissions within 2 working days and send the database accession number(s) assigned to your data to your e-mail address. 3.3 Bulk Submissions With the aim to make bulk sequence submission less time consuming for the submitters, a new web-based submission system can now be accessed from the WebIn page. Authors planning to submit a large number of similar sequences (i.e.,>25) are presented with an option for "Bulk WebIn Submission". When choosing thebulk path, submitters carry on the usual WebIn submission procedure untilhaving finished a first and single representative sequence. During the submission process database staff will interactively assist in making the submission of this specific data as convenient as possible, thus saving the author the time and effort required to complete numerous submission events individually. Alternatively, authors planning to submit very large numbers of similar sequences should contact the database before submitting the data. Database staff will create series of templates and communicate these to the author for completion with just the information unique to each sequence required. Please contact database staff if you require further information. e-mail: datasubs@ebi.ac.uk Tel: +44-1223-494499 Fax: +44-1223-494472 3.4 SEQUIN - Stand-alone Submission Program Sequin is the multi-platform (Mac/PC/Unix) stand-alone software tool developed by the NCBI for submitting entries to the EMBL, GenBank, or DDBJ sequence databases. The Sequin program, along with detailed downloading and installation instructions plus general information are available from the EBI via WWW and anonymous FTP. http://www3.ebi.ac.uk/Services/Sequin/ ftp://ftp.ebi.ac.uk/pub/software/sequin/ 3.5 Sequence Alignment Submissions The EBI accepts submissions of alignment data (e.g. from phylogenetic and population analysis etc) of both nucleotide or amino-acid sequences, database staff assigns an alignment number (e.g. ds38200), which is then communicated to the submitter. We suggest that this number is quoted in the resulting publication. Alignment data and associated information are made available via EBI's network servers (see below). ALIGNMENT FORMATS: As well as your alignment data we require information describing your alignment (see table below) Please provide information for all fields. Description Field Information required TITLE: Title of alignment SUBMITTER: Name, Affiliation, Phone, Fax, Email RELEASE DATE: Public Immediately / if Confidential please provide hold date CITATION: If known please provide complete Author list, Title, Journal, Year of publication, Page numbers ALIGNMENT METHOD: Method of alignment and format submitted, parameters of alignment sequences used (if appropriate) DESCRIPTION OF e.g. Gaps indicated by a dash '-' SYMBOLS: DESCRIPTION OF Describe sequences aligned, including accession ALIGNMENT: numbers (if known) and abbreviation of clones or taxon used in alignment file. If your alignment contains sequences derived from multiple taxoonomic sources, please provide the full name of each organism FILE FORMAT: We suggest submission in STANDARD ALIGNMENT FORMATS eg. (NEXUS, PHYLIP, CLUSTALW etc) or Sequin output. A sample alignment in NEXUS format can be viewed at ftp://ftp.ebi.ac.uk/pub/databases/embl/align/ds32096.dat NOTE 1: Alignments can be created within Sequin or imported into Sequin from files in a standard alignment format like NEXUS or PHYLIP. NOTE 2: If reporting new primary sequence data, we suggest that you submit the complete individual sequence files (e.g. via Sequin or WebIn), in order to include the sequence data as individual entries in the EMBL database. If gaps have been introduced for the alignment, please leave them out when sending the individual sequence files. SENDING ALIGNMENT DATA to the EMBL Nucleotide Sequence Database Sequence alignment data can be sent to the Nucleotide Sequence Database by Electronic mail to datasubs@ebi.ac.uk ACCESSING ALIGNMENT DATA Alignment data and additional information are available via the EBI servers: EBI WWW server: http://www.ebi.ac.uk/embl/Submission/alignment.html ftp://ftp.ebi.ac.uk/pub/databases/embl/align/ EBI FTP server: by anonymous FTP from FTP.EBI.AC.UK in directory pub/databases/embl/align EBI File server: by sending an e-mail message to netserv@ebi.ac.uk including the line HELP ALIGN or GET ALIGN:DS8200.DAT 3.6 Further Submission Information 3.6.1 Annotation Guides To help and guide submitters in annotating their sequences, two online guides are available via hyperlinks from within WebIn: EMBL Annotation Examples (http://www3.ebi.ac.uk/Services/Standards/web/) and EMBL Features and Qualifiers (http://www3.ebi.ac.uk/Services/WebFeat/). The annotation examples consist of a list of EMBL approved feature table annotations for common biological sequences. The EMBL Features and Qualifiers is a complete list of feature table key and qualifier definitions providing detailed descriptions, mandatory and optional qualifiers and usage examples. For further information on submission of sequence data to the EMBL Nucleotide Sequence Database please access: http://www.ebi.ac.uk/embl/Submission/ or contact database staff at: EMBL Nucleotide Sequence Submissions e-mail: datasubs@ebi.ac.uk telephone: +44-1223-494499 telefax: +44-1223-494472 4 CITING THE EMBL NUCLEOTIDE SEQUENCE DATABASE We encourage authors to include a reference to the EMBL Database in publications related to their research. When citing data in the EMBL Database, we suggest to give the according primary accession and the publication in which the sequence first appeared. For unpublished data, we suggest to contact the original submitters for recent publication information or revisions of the data. We suggest to also provide a reference for the EMBL Database itself. Our recent publication describing the EMBL database should be cited: Baker W., van den Broek, A., Camon E., Hingamp P., Sterk P., Stoesser G., and Tuli M.A.. 'The EMBL Nucleotide Sequence Database', Nucl. Acids Res., 28 (1), 19-23 (2000). Example: The numbers in parentheses refer to the REFERENCE in the EMBL database entry, and to the EMBL citation above. "Sequence entry X56734 (1) has been retrieved from the EMBL Database (2) and showed significant sequence similarity to ..." (1) Oxtoby, E., et al., Plant Mol. Biol. 17:209-219(1991). (2) Baker, W., et al., Nucl. Acids Res. 28:19-23(2000) 5 EBI NETWORK SERVICES 5.1 Electronic Mail Server Computer users with access to Internet (directly or via a gateway) can obtain copies of database entries, documentation or the data submission form, by sending commands to a file server running at EBI. New and updated EMBL nucleotide sequence entries are made available on the server on a daily basis. To use this facility, send file server commands (as electronic mail) to the address netserv@ebi.ac.uk. Each line of the mail message should consist of a single file server request. The most important file server request, to get started, is: HELP If the file server receives this command, it will return a helpfile to the sender, explaining in some detail how to use the facility. For example, to request a copy of the nucleotide sequence with accession number X55652, use the command: GET NUC:X55652 The file server offers various other services, (eg., access to nucleotide and protein sequence data, protein structure data, software), details of which are provided in the HELP file. 5.2 Anonymous FTP Server An alternative method of accessing the EBI archives is to use the Internet File transfer protocol (ftp). Researchers with direct access to the Internet can use the FTP program on their local machine to connect to the host FTP.EBI.AC.UK and enter the username "anonymous" and their email address as password. The directory pub/help contains detailed information about the data available from the EBI anonymous FTP server which includes the complete EMBL Nucleotide Sequence Database releases as well as daily and weekly updates and a cumulative update file (in UNIX-compressed format)in the following directories: EMBL quarterly release: pub/databases/embl/release EMBL updates: pub/databases/embl/new 5.3 World Wide Web (WWW) Server The EBI operates a WWW server with URL http://www.ebi.ac.uk/ which gives access to information about the EBI and it's products and services. Nucleotide sequences can be retrieved by a simple query by accession number, or more complex queries can be contructed using an SRS WWW databank browser. Nucleotide sequences can also be submitted to the database using the interactive submission system WebIn at URL: http://www.ebi.ac.uk/embl/Submission/webin.html 5.4 Sequence Similarity Search Servers The EBI offers two network servers for sequence similarity searches via electronic mail or interactive WWW forms: FASTA based on W. Pearson's FASTA algorithm. Allows local similarity searches of protein and nucleotide sequence databases. Send "help" to Fasta@EBI.AC.UK or use URL http://www.ebi.ac.uk/fasta3/ BLAST based on the NCBI and WU-Blast software Send "help" to Blast@EBI.AC.UK or use URL http://www.ebi.ac.uk/blast2/ BLITZ allows very fast searches of protein sequence databases for local similarities using an exhaustive Smith-Waterman matching algorithm. Compugen's BIC_SW software is running on a Biocellerator (BIC-2) Send "help" to Blitz@EBI.AC.UK or use URL http://www.ebi.ac.uk/bic_sw/ 6 DISTRIBUTION FILES 6.1 Release 64 Files The release contains the files shown below, in the order listed. File sizes are given as numbers of records. File Number File Name Description Number of Records 1 DELETEAC.TXT Deleted accession numbers 44649 2 FTABLE.TXT Feature Table Documentation 465 3 RELNOTES.TXT Release Notes (this document) 915 4 SUBFORM.TXT Data Submission Form 418 5 SUBINFO.TXT Data Submission Documentation 333 6 UPDATE.TXT Data Update Form 107 7 USRMAN.TXT User Manual 1469 8 ACNUMBER.NDX Accession Number Index 8372365 9 CITATION.NDX Citation Index 1872434 10 DIVISION.NDX Division Index 23 11 KEYWORD.NDX Keyword Index 3109242 12 SHORTDIR.NDX Short Directory Index 21428207 13 SPECIES.NDX Species Index 2888410 14 EST_FUN.DAT EST Sequences 3491596 15 EST_HUM1.DAT EST Sequences 7242162 16 EST_HUM2.DAT EST Sequences 7383411 17 EST_HUM3.DAT EST Sequences 7092087 18 EST_HUM4.DAT EST Sequences 6958043 19 EST_HUM5.DAT EST Sequences 7086795 20 EST_HUM6.DAT EST Sequences 7098043 21 EST_HUM7.DAT EST Sequences 7136249 22 EST_HUM8.DAT EST Sequences 7031857 23 EST_HUM9.DAT EST Sequences 7156374 24 EST_HUM10.DAT EST Sequences 6859020 25 EST_HUM11.DAT EST Sequences 6661083 26 EST_HUM12.DAT EST Sequences 6431484 27 EST_HUM13.DAT EST Sequences 6811351 28 EST_HUM14.DAT EST Sequences 6856402 29 EST_HUM15.DAT EST Sequences 7036586 30 EST_HUM16.DAT EST Sequences 7306475 31 EST_HUM17.DAT EST Sequences 7263236 32 EST_HUM18.DAT EST Sequences 7357458 33 EST_HUM19.DAT EST Sequences 7444208 34 EST_HUM20.DAT EST Sequences 7476190 35 EST_HUM21.DAT EST Sequences 6699624 36 EST_HUM22.DAT EST Sequences 6963358 37 EST_HUM23.DAT EST Sequences 4588499 38 EST_INV1.DAT EST Sequences 6431773 39 EST_INV2.DAT EST Sequences 6042873 40 EST_INV3.DAT EST Sequences 6293598 41 EST_INV4.DAT EST Sequences 4046341 42 EST_MAM1.DAT EST Sequences 6114230 43 EST_MAM2.DAT EST Sequences 2356039 44 EST_PLN1.DAT EST Sequences 6750911 45 EST_PLN2.DAT EST Sequences 6219344 46 EST_PLN3.DAT EST Sequences 5830564 47 EST_PLN4.DAT EST Sequences 7215994 48 EST_PLN5.DAT EST Sequences 7046836 49 EST_PLN6.DAT EST Sequences 6762278 50 EST_PLN7.DAT EST Sequences 6720107 51 EST_PLN8.DAT EST Sequences 6029205 52 EST_PRO.DAT EST Sequences 38548 53 EST_ROD1.DAT EST Sequences 7331559 54 EST_ROD2.DAT EST Sequences 7567611 55 EST_ROD3.DAT EST Sequences 7220551 56 EST_ROD4.DAT EST Sequences 7549688 57 EST_ROD5.DAT EST Sequences 6811012 58 EST_ROD6.DAT EST Sequences 7086810 59 EST_ROD7.DAT EST Sequences 9771985 60 EST_ROD8.DAT EST Sequences 9130283 61 EST_ROD9.DAT EST Sequences 7665029 62 EST_ROD10.DAT EST Sequences 9177208 63 EST_ROD11.DAT EST Sequences 9743196 64 EST_ROD12.DAT EST Sequences 9700691 65 EST_ROD13.DAT EST Sequences 9653685 66 EST_ROD14.DAT EST Sequences 9473210 67 EST_ROD15.DAT EST Sequences 9015774 68 EST_ROD16.DAT EST Sequences 6666497 69 EST_ROD17.DAT EST Sequences 7649778 70 EST_ROD18.DAT EST Sequences 7420422 71 EST_ROD19.DAT EST Sequences 738690 72 EST_VRT1.DAT EST Sequences 7641169 73 EST_VRT2.DAT EST Sequences 2254064 74 FUN.DAT Fungi Sequences 3736027 75 GSS1.DAT Genome Survey Sequences 6116578 76 GSS2.DAT Genome Survey Sequences 6118824 77 GSS3.DAT Genome Survey Sequences 6268149 78 GSS4.DAT Genome Survey Sequences 6628318 79 GSS5.DAT Genome Survey Sequences 6554451 80 GSS6.DAT Genome Survey Sequences 6616068 81 GSS7.DAT Genome Survey Sequences 6639716 82 GSS8.DAT Genome Survey Sequences 6644800 83 GSS9.DAT Genome Survey Sequences 6958158 84 GSS10.DAT Genome Survey Sequences 6788195 85 GSS11.DAT Genome Survey Sequences 7155659 86 GSS12.DAT Genome Survey Sequences 6988978 87 GSS13.DAT Genome Survey Sequences 6978243 88 GSS14.DAT Genome Survey Sequences 6402203 89 GSS15.DAT Genome Survey Sequences 6646868 90 GSS16.DAT Genome Survey Sequences 7448747 91 GSS17.DAT Genome Survey Sequences 6669805 92 GSS18.DAT Genome Survey Sequences 1027489 93 HTG1.DAT High Throughput Genome Sequences 7854248 94 HTG2.DAT High Throughput Genome Sequences 5995734 95 HTG3.DAT High Throughput Genome Sequences 4210260 96 HTG4.DAT High Throughput Genome Sequences 4724917 97 HTG5.DAT High Throughput Genome Sequences 8718298 98 HTG6.DAT High Throughput Genome Sequences 8721834 99 HTG7.DAT High Throughput Genome Sequences 8979368 100 HTG8.DAT High Throughput Genome Sequences 8137472 101 HTG9.DAT High Throughput Genome Sequences 7846179 102 HTG10.DAT High Throughput Genome Sequences 4273070 103 HTGO.dat High Throughput Genome Sequences 8701440 104 HUM1.DAT Human Sequences 9494007 105 HUM2.DAT Human Sequences 5320579 106 HUM3.DAT Human Sequences 3561983 107 HUM4.DAT Human Sequences 2858503 108 HUM5.DAT Human Sequences 2298449 109 HUM6.DAT Human Sequences 1644433 110 INV.DAT Invertebrate Sequences 9495348 111 MAM.DAT Other Mammal Sequences 1908267 112 ORG.DAT Organelle Sequences 5140625 113 PATENT.DAT Patent Sequences 8110279 114 PHG.DAT Bacteriophage Sequences 217840 115 PLN.DAT Plant Sequences 8269953 116 PRO1.DAT Prokaryote Sequences 6104496 117 PRO2.DAT Prokaryote Sequences 4233076 118 ROD.DAT Rodent Sequences 4755562 119 STS.DAT STS Sequences 7970081 120 SYN.DAT Synthetic Sequences 394629 121 UNC.DAT Unclassified Sequences 106371 122 VRL.DAT Viral Sequences 7545287 123 VRT.DAT Other Vertebrate Sequences 1787491 6.2 SRS Indices SRS indices can be found on the FTP server in the srs directory ftp://ftp.ebi.ac.uk/pub/databases/embl/release/srs/. See README file for details. Please note that the next version of SRS indices will be for version 607x and not 606. APPENDIX A DATABASE GROWTH TABLE The following table shows the growth of the EMBL Nucleotide Sequence Database at each release. Release Month Entries Nucleotides 1 06/1982 568 585433 2 04/1983 811 1114447 3 12/1983 1481 1654863 4 08/1984 1698 2147205 5 04/1985 2378 2874493 6 08/1985 4835 4567592 7 12/1985 5789 5622638 8 04/1986 6395 6353040 9 09/1986 7630 7813214 10 12/1986 8817 9766948 11 04/1987 11621 12189783 12 07/1987 12706 13638061 13 10/1987 14397 16023478 14 01/1988 15344 17272160 15 05/1988 17961 20318442 16 08/1988 19592 22625941 17 11/1988 20695 24211054 18 02/1989 22938 27249830 19 05/1989 24365 29066676 20 08/1989 26223 31240948 21 11/1989 28679 34748087 22 02/1990 31508 38165786 23 05/1990 34902 42923803 24 08/1990 37784 47354438 25 11/1990 41580 52900354 26 02/1991 43745 55859549 27 05/1991 46871 59915244 28 09/1991 54558 70448052 29 12/1991 57655 75400487 30 03/1992 63378 83574342 31 06/1992 72481 94390065 32 09/1992 79377 101292310 33 12/1992 89100 111413979 34 03/1993 99591 121420828 35 06/1993 108973 131880111 36 09/1993 127933 145401156 37 12/1993 146576 158171400 38 03/1994 167777 177550115 39 06/1994 182615 192195819 40 09/1994 209352 211017104 41 12/1994 230950 226259607 42 03/1995 303206 262559786 43 06/1995 420111 315840053 44 09/1995 506190 363273777 45 12/1995 622566 427620278 46 03/1996 701246 473691480 47 06/1996 827174 550739395 48 09/1996 928067 608931850 49 12/1996 1047263 696183789 50 03/1997 1187455 789755858 51 06/1997 1432941 931351601 52 10/1997 1787004 1181167498 53 12/1997 1917868 1281391651 54 03/1998 2125225 1427634373 55 06/1998 2330040 1607673907 56 09/1998 2689618 1904091473 57 12/1998 3046471 2164718256 58 03/1999 3272064 2355200790 59 06/1999 3952878 2924568545 60 09/1999 4719266 3543553093 61 12/1999 5303436 4508169737 62 03/2000 5865742 6120908677 63 06/2000 6760113 8255674441 64 09/2000 8344436 9650223037