Finding sequences with Entrez at NCBI.

Note: this page updated 9/21/06.  NCBI constantly revises the features on their pages, so specific information given below may no longer apply at a later date.

Databases at NCBI

Three major bioinformatics centers in the world accumulate protein and nucleic acid entries from a variety of sources and mutually exchange them to create a common database system.  These are: The general retrieval tool at NCBI is called Entrez.  Entrez is a general user interface featuring a complex query language.  The query language allows retrieving sets of sequences based on a variety of descriptor fields (such as source organism, date, keywords, accession numbers, etc.) attached to the individual entries.  The main Entrez interface page at NCBI has selections for searching a variety of databases ranging from protein sequences to literature to SNPs. Each database has its own set of allowable descriptors assigned to it.  This document is confined to the protein and nucleotide databases, which have essentially the same descriptor options.

Alternative retrieval schemes.

You can locate a sequence of interest by Blast or Psi-Blast searching and just follow the link to the sequence entry.  At the bottom of the NCBI blast page, are special blast searches for specific genomes or for some special sequence collections.

The EMBL and DBBJ systems have a form-based retrieval system called SRS.

The local implementation of GCG has a command named lookup which is designed after SRS, but its indexes are badly out of date.  You should not use GCG for direct sequence retrieval, or for searching of its intrinsic databases.

The site licensed Vector NTI program directly opens a browser window to NCBI Entrez.
The site licensed Lasergene program uses a form to create a query that is sent to NCBI Entrez.  This system failed a test of a simple gi number retrieval.

The gi number problem:

Originally database entries were given accession numbers (which always start with letters).  Eventually it was discovered that there were accession numbers used for more than one sequence.  So a new numbering system was created called "global identifiers" or gi numbers in order to give each entry a unique identifier.  For reasons that are unclear, gi numbers are not one of the descriptors with a named field defined for the protein and nucleic acid databases at NCBI.   Hence requests to retrieve gi numbers are special cased, and lead to discrepancies in the way that various Entrez interfaces behave.

Extra help on the main Entrez page.

The query language:

Accession numbers and version numbers.

Giving accession numbers or gi numbers to BLAST.

Redundancy and looking for more documentation about a particular sequence.

Saving a search to repeat at a later time

Using Entrez queries to clean up a Blast alignment

Pre computed Blast results:


Carving small pieces of genomic sequences from large nucleotide entries:

Downloading various formats from Entrez:

Genome sequences are sometimes not completely assembled.



Last updated 09/21/2006; Steve Hardies