Finding sequences with Entrez at NCBI.
Note: this page updated 9/21/06. NCBI constantly revises the features
on their pages, so specific information given below may no longer apply
at a later date.
Databases at NCBI
Three major bioinformatics centers in the world accumulate protein and
nucleic acid entries from a variety of sources and mutually exchange them
to create a common database system. These are:
-
NCBI (http://www.ncbi.nlm.nih.gov/)
-
EMBL databases at EBI (http://www.ebi.ac.uk/)
-
DBBJ (http://www.ddbj.nig.ac.jp/)
The general retrieval tool at NCBI is called Entrez.
Entrez is a general user interface featuring a complex query language.
The query language allows retrieving sets of sequences based on a variety
of descriptor fields (such as source organism, date, keywords, accession
numbers, etc.) attached to the individual entries. The main Entrez
interface page at NCBI has selections for searching a variety of databases
ranging from protein sequences to literature to SNPs. Each database has
its own set of allowable descriptors assigned to it. This document
is confined to the protein and nucleotide databases, which have essentially
the same descriptor options.
Alternative retrieval schemes.
You can locate a sequence of interest by Blast or Psi-Blast searching and
just follow the link to the sequence entry. At the bottom of the
NCBI
blast page, are special blast searches for specific genomes or for
some special sequence collections.
The EMBL and DBBJ systems have a form-based retrieval system called
SRS.
The local implementation of GCG has a command named lookup which is
designed after SRS, but its indexes are badly out of date. You should
not use GCG for direct sequence retrieval, or for searching of its intrinsic
databases.
The site licensed Vector NTI program directly opens a browser window
to NCBI Entrez.
The site licensed Lasergene program uses a form to create a query that
is sent to NCBI Entrez. This system failed a test of a simple gi
number retrieval.
The gi number problem:
Originally database entries were given accession numbers (which always
start with letters). Eventually it was discovered that there were
accession numbers used for more than one sequence. So a new numbering
system was created called "global identifiers" or gi numbers in order to
give each entry a unique identifier. For reasons that are unclear,
gi numbers are not one of the descriptors with a named field defined for
the protein and nucleic acid databases at NCBI. Hence requests
to retrieve gi numbers are special cased, and lead to discrepancies in
the way that various Entrez interfaces behave.
-
Retrieve gi numbers as pure space delimited numbers: If pure
numbers are typed in the NCBI Entrez query box separated by spaces, they
are interpreted as gi numbers and each of the entries is retrieved.
The spaces are therefore interpreted as logical OR. Requests for
gi numbers can not be mixed with other elements of the query language,
or modified by the "limits" menu. Sets retrieved this way can be
combined with other sets after the fact using AND, OR, or NOT (see below)..
-
Retrieve anything else using the query language: If there are any
letters or punctuation typed in the box, then the numbers are interpreted
as text within the context of the query language described below.
gi numbers entered in that way won't match anything. You can not
retrieve a mixture of gi numbers and accession numbers by directly typing
them in the Entrez query box.
-
The "limit by Entrez query" box in Blast follows the same rules as the
Entrez query box, except set numbers (see below) are not allowed.
-
Batch Entrez (Found under Blast Tools) follows a completely different
set of rules:
-
Find Batch Entrez in the frame at the left of the Entrez page.
-
For Batch Entrez, the query is in a text file that you upload.
-
The query can be a mixture of gi numbers, accession numbers, and keywords.
-
The items are separated by spaces or line breaks.
-
Logical operators and field codes are not recognized; the logical operator
corresponding to a space is assumed to be OR.
Extra help on the main Entrez page.
-
Although you can search a database from the NCBI top page, you won't see
the more powerful Entrez tools available. Always go to the Entrez
page and then pick your database. There are much better tools there
for defining your search, as well as links to documentation. You
can access an entrez page by the "all databases" toolbar item. Conducting
any search from the main page returns the corresponding Entrez search page.
The query language:
-
By default, all fields are searched.
-
One can limit the search to specific fields (such as accession number,
organism, keyword, date, or length) by activating the "limits" menu.
-
One can also limit the search by appending a field code to the word in
the query box. e.g. human [orgn] only retrieves sequence from humans.
You can use the taxonomy browser to find the correct terms to use with
the [orgn] code to limit searches to various groups in the phylogenetic
hierarchy. If the term has multiple words, surround it in quotes:
e.g.. "Escherichia coli"[orgn]
-
Words separated by spaces are assumed to be connected by OR. You
can explicitly construct queries with AND, OR, and NOT. You can use
parentheses to construct the query. Logical operators must be
capitalized.
-
Multiword searches return any entry with those words in the documentation,
but not necessarily near each other. For example, searching the nr
nucleotide database for the fictitious enzyme maltose dehydrogenase returns
lots of completely sequenced genomes for you to thrash around in trying
to find this nonexistent gene.
-
Putting "maltose dehydrogenase" in quotes now causes a search of the words
adjacent to each other. Note that if the phrase is not found, there
will be a message to that effect, followed by the long list of nonsense
finds.
-
Complex queries can be built: e.g.. kinase AND (primate[orgn] NOT human[orgn])
returns non human primate kinases.
-
Complex queries can be put in the Blast "limit by Entrez query" boxes.
-
Of course, you have no reason to believe that every kinase has been labeled
with the word "kinase". Also, many sequences that are not kinases
will be returned because the word "kinase" appears in the title of a cited
paper.
-
There are two menu driven ways to construct complex queries:
-
Conduct individual searches and then combine them
-
Go to the "history" menu and see that your previous searches are listed
by number
-
Combine them by typing constructs like #1 AND #2, or #1 OR #2, or
#1 NOT #2 in the query box.
-
You may also mix the previously defined sets with explicit syntax, such
as (#1 OR #2) AND human[orgn]
-
The set numbers may not be used for the Blast "limit by Entrez query" option.
-
You can use the tools on the "preview/index" page to add clauses to an
existing query.
-
There are two ways to try to find the appropriate field codes for a complex
query:
-
Try to find it in the help documentation attached to the Entrez page.
-
Do a menu driven search and then click the "details" menu. It will
display the equivalent search written out with field codes. For example,
to narrow the proteins found in a prior search #2 to those 396 residues
long, I selected "sequence length" from the limits box and entered #2 AND
396:396 in the search box. After conducting the search, the details
button reveals that the full syntax was #2 AND ( 396 [Sequence Length]
: 396 [Sequence Length]).
Accession numbers and version numbers.
-
When working with accession numbers, the number appended after the decimal
is a version number. Generally you do not include the version number
when specifying the accession in an Entrez query. Entrez then returns
the most recent version. You may retrieve an earlier version if you
explicitly type the version number in the query box. In many cases,
the revision involves the documentation rather than the sequence.
Replaced versions are not present in the local bioinf NCBI database mirror,
and are not searched by the NCBI blast web page.
Giving accession numbers or gi numbers to BLAST.
-
Blast at NCBI will accept the gi number or the accession number (including
replaced accessions) in the query box instead of actual protein sequence.
Other blast servers generally won't do that.
Redundancy and looking for more documentation about a particular sequence.
-
There is redundancy in the database derived from importing the same sequence
from multiple curated databases that each assigned their own accession
numbers and gi numbers. The level of documentation in each of the
curated databases could vary considerably. Typically, the swissprot
entry is the best documented.
-
A common way to find all the redundant entries is to do a Blast search
with the sequence and use the default pairwise alignment display.
The first hit will give links to all the mutually identical redundant entries.
-
Note: in the local bioinf mirror of the library you can quickly find redundant
entries for the protein database by grep <gi number> $BLAST_DB/nr,
where the gi number (or alternatively the accession) corresponds to one
entry you already know.
-
If there is a Blink tab in the NCBI Entrez display, that may have the information
on duplicate entries.
-
The records displayed by clicking the Swiss Prot (and the PDB) links at
NCBI are not the real thing. They have been reformatted from the
records at those other databases, often with loss of information.
You have to copy the accession number, go to the other database's web site,
and paste it in their search box to get the real thing.
-
In mammalian species, and now extending into other species, NCBI is making
another database called RefSeqs where they are trying to put one highly
curated cDNA entry per gene, one curated protein entry per gene, and one
DNA entry per genomic segment. Entrez searches can be limited to
RefSeqs in the "only from" limit box. Also, RefSeq accession numbers
are the only ones with underscores in them, so they are easy to pick out
of a blast search result. If you plan to do much with the NCBI databases,
you should read their documentation
for RefSeq (http://www.ncbi.nlm.nih.gov/LocusLink/refseq.html ).
-
If you are really looking for a lot of documentation, try the Molecular
Information Agent (http://mia.sdsc.edu/). That site launches
a mass search against about 50 databases. An accession number for
Swiss Prot or PDB is probably the best search key for it, but NCBI accession
numbers, keywords, or even sequence is accepted as a search key.
Also, they have an excellent list of web sites for relevant databases.
Saving a search to repeat at a later time
-
To set up to repeat an Entrez search at a future time do this:
-
Do the search now.
-
Click <details> (at the bottom of the page).
-
Click <URL>
-
Bookmark the page that results in your browser.
-
At a future time, if you activate your bookmark, the search will be repeated.
-
For example, after finishing some search, you could impose a limit for
entries after today's date. Then go though the above process to bookmark
that search (which of course finds nothing). At a later time, when
you click your bookmark you will retrieve all entries made or modified
after this date. Unfortunately the name saved for the bookmark is
somewhat generic. If you use Netscape, the way to rename a bookmark
is to click <bookmarks>, <edit bookmarks>, right click the specific
bookmark, and then edit the name.
-
The search query saved in the bookmark has to be free of references to
other searches. So instead of #2 AND (396 [Sequence Length] : 396
[Sequence Length]) use
-
("maltose binding protein" AND (396 [Sequence Length] : 396 [Sequence Length]))
Using Entrez queries to clean up a Blast alignment
-
You can force Blast to process just a list of finds from Entrez and align
them.
-
When you set up the Blast search, use one of the accession numbers as the
query.
-
In the advanced options section, put the Entrez search query where indicated.
-
This would usually be a list of gi numbers separated by spaces or a list
of accession numbers separated by OR. A mixed list does not work.
-
Ask for alignments to be displayed flat with identities.
-
You probably want to uncheck the box causing display of gi numbers.
In the alignment itself, this causes the more recognizable accession numbers
to be displayed instead of the gi numbers.
-
Note: a related procedure available at the NCBI page but not elsewhere
is to first do a Blast (or PsiBlast) search without limits, then redisplay
the result with alignment specified and a list of the entries you want
to see in the alignment filled in at the bottom of the query display form.
Pre computed Blast results:
-
If you are tempted to just do a Basic Blast search with some sequence entry
as the key, first look and see if there is a "BLink" link or "domains"
button in the list of links at the top right of the entry display in Entrez.
This will give the results of a pre computed Blast search or CDD search,
with a nice graphics display and the option to do a taxonomic sort.
-
There may, however, be more to find if you do the search yourself.
In particular, you may find deeper relationships by using psiblast for
protein sequences.
-
Clicking on "Links" if present will give a popup menu of additional linked
resources. I notice that external resources once present have a way
of disappearing from the Links menu. There may be considerably more
information at other specialized databases. Try Google or BioHunt
at ExPasy if there seems to be a genome project surrounding the sequences
you are seeking.
Carving small pieces of genomic sequences from large nucleotide
entries:
-
You can anticipate the coordinates of a section you want either because
it showed up in a blast search, or because you first retrieved a protein
entry which will list the corresponding coordinates within a genomic entry.
-
Within Entrez, retrieve the genomic entry by its accession number.
If it is sluggish, you can stop the download as soon as the top banner
displays. Then click the "get subsequences" button. Fill in
the coordinates and a new entry will display with the sequence and features
renumbered to make the first base retrieved number 1.
-
Alternatively from the local bioinf libraries:
-
fastacmd -s <gi number> -d $BLAST_DB/euk_genome -L x,y -o <output
filename> retrieves coordinates x to y from the human chromosome entry
with the indicated gi number.
-
fastacmd -s <gi number> -d $BLAST_DB/prok_genome -L x,y -o <output
filename> retrieves coordinates x to y from the procaryote genome entry
with the indicated gi number.
-
If in doubt about the appropriate database name, try fastacmd <gi
number> -pF -d $BLAST_DB/nr -L x,y -o <output filename>
Downloading various formats from Entrez:
-
Notice that the display list box allows you to download representations
other than the flat GenBank form of the sequences found by Entrez.
Most useful are fasta, and gilist. The fasta option downloads the
entire list of hits as a single multi-fasta file, which is the lowest common
denominator for importing to other programs. The gilist option downloads
just the gi numbers one per line. This file can be used in the local
system to extract the entries from the local fasta file (fastacmd -i
gilist $BLAST_DB/nr -o <output filename>) or directly to limit searches
by Blast or PsiBlast.
-
After downloading a set of sequences, or gi numbers, count them.
A common error is to only get the first displayed page full.
Genome sequences are sometimes not completely assembled.
-
In working with genomic sequences, beware that a stretch of 100 n's means
a gap of unknown length, and that the fragments between the gaps are not
necessarily in the correct order or relative orientation.
Last updated 09/21/2006; Steve Hardies