Databases available for BLAST searching at UTHSCSA
The Blast databases are updated nightly from NCBI. Divisions are
defined below.
By NetBlast, one can only search named divisions, although some of
those are subsets of others, or are the union of other divisions.
By command line blast programs, one can search the union of multiple divisions
with the syntax -d "div1 div2 ..."
Last updated 11/30/2004 - Steve Hardies
Changes may occur in the arrangement of the divisions based on pending
changes at NCBI.
Peptide Sequence Databases
nr
Described by NCBI as "All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF"
Protein nr contains essentially all the protein entries that there are.
The same sequence may be present with different gi numbers as a GenBank
entry, an EMBL entry, a SwissProt entry, etc. The "nonredundant"
aspect of the organization is that the actual sequence for redundant entries
is only represented once, hence only searched once. If matched in
a blast search, links to all the entries corresponding to that sequence
are then given.
nr.redundant
Same as nr, but one entry per sequence even if the sequence is identical
to another entry. It is not obvious that there would be any reason
to burn the extra search time to search this library. Its main utility
may be for use with fastacmd to retrieve sequences with clean definition
lines.
month.aa [no longer available; NCBI has dropped support of month databases;
One can implement a search of entries limited to any time interval by use
of the "limit by Entrez query" option available in NetBlast and at NCBI,
but not with command line Blast.]
A rolling 30 day look back at new sequences released into nr.
swissprot
the last major release of the SWISS-PROT protein sequence database (no
daily updates)
yeast.aa
A single curated set (NCBI's RefSeq set) of Yeast (Saccharomyces cerevisiae)
protein sequences. Compared to searching nr with "Saccharomyces cerevisiae"[orgn]
in the Entrez query field, searching yeast.aa is faster, but will only
show links to RefSeq entries. The latter may find additional yeast
protein sequences that RefSeq curators chose not to include.
ecoli.aa
A single curated set (NCBI's RefSeq set) of E.coli K12 genes. See
yeast.aa for implications relative to limiting by Entrez query.
drosoph.aa
A single curated set (NCBI's RefSeq set) of Drosophila melanogaster
genes. See yeast.aa for implications relative to limiting Entrez query.
mito.aa
Mitochondrially encoded proteins.
pdbaa
Sequences derived from the 3-dimensional structure Brookhaven Protein Data
Bank
igSeqProt
Kabat's database of sequences of immunological interest
alu.a
Translations of select Alu repeats from REPBASE, suitable for masking Alu
repeats from query sequences. It is available by anonymous FTP from ncbi.nlm.nih.gov
(under the /pub/jmc/alu directory). See "Alu alert" by Claverie and Makalowski,
Nature vol. 371, page 752 (1994) .
pataa
Sequences submitted in support of patent applications.
env_nr
Translations from env.nt
env_nr.redundant
Same as env_nr, but one sequence per definition line.
-
Nucleotide Sequence Databases
-
nr
-
No longer supported. nr for nucleotides was a union of essentially
every division below. One can construct a search of any arbitrary
union of divisions using command line blast.
-
nt
-
nt contains all the spliced sequences inferred from genomic sequences.
-
human_genomic
-
23 entries representing the RefSeq version of the human chromosomes.
Gaps of undefined size are represented by strings of 100 N's.
-
other_genomic
-
Complete chromosomes or genomes from many bacteria, viruses, bacteriophages,
plasmids, yeast, drosophila, Arabidopsis, and C. elegans. Notably
missing are mammals or Zebrafish. See this link for a 11/30/2004
listing of the genomes and chromosomes in "other_genomic". For
partially finished genomes, fragmentary sequences can be found in htgs
and wgs divisions below. However, there may be more organized "freezes"
or "builds" curated to remove redundancies and group sequences by their
respective chromosomes. Check NCBI
genome resources to see if such a dataset exists and if a special Blast
page exists for searching it there. On request, we can import such
datasets for local access.
-
yeast.nt
-
The RefSeq entries for the Saccharomyces cerevisiae chromosomes.
-
ecoli.nt
-
The RefSeq entry for the E. coli chromosome.
-
drosoph.nt
-
The RefSeq enties for the Drosophila melanogaster chromosomes.
-
est
-
The union of the 3 est division below.
-
est_human
-
Expressed sequence tags from humans. There may be many entries for
each gene, but in contrast to nt, only sequences corresponding to cDNA
clones that actually exist are included.
-
est_mouse
-
est_others
-
pdbnt
-
Nucleotide sequences corresponding to protein sequences in the Brookhaven
Protein Databank.
-
igSeqNt
-
Nucleotide sequences from Kabat's database of sequences of immunological
interest
-
vector
-
Sequences of plasmids, etc. used as cloning vectors.
-
sts
-
A separate collection of short genomic sequences entered with primer information
as Sequence Tagged Sites.
-
gss
-
Genome Survey Sequences, includes single-pass genomic data, exon-trapped
sequences, and Alu PCR sequences.
-
htgs
-
A collection of partially assembled sequences from the genome centers.
These sequences are organized by Bac, or other large insert clone.
See notes under "other_genomic" for alternatives to searching htgs.
-
wgs
-
Another collection of partially assembled sequences from the genome centers.
These are contigs assembled directly from whole genome shotgun sequencing.
See notes under "other_genomic" for alternatives to searching wgs.
-
mito.nt
-
mitochondrial sequences
-
alu.n
-
Select Alu repeats from REPBASE, suitable for masking Alu repeats from
query sequences. It is available by anonymous FTP from ncbi.nlm.nih.gov
(under the /pub/jmc/alu directory). See "Alu alert" by Claverie and Makalowski,
Nature vol. 371, page 752 (1994).
-
env.nt
-
DNA sequence directly from the environment (ie. from all organisms mixed
together)
month.nt, month.htgs, month.est_human, month.est_mouse, month.est_other,
month.gss
-
[no longer available; NCBI has dropped support of month databases;
one can implement a search limited to entries from any time period using
the "limit to Entrez query" option available in NetBlast and at the NCBI
site, but not in command line Blast].
-
A rolling one month lookback on sequences entered into the respective divisions
over the last 30 days.
-
Last updated 3/3/2005 - Steve Hardies