PSI-Blast.
PSI-Blast is the preferred
method for searching a protein database with a protein sequence as the
key. If used for only one round, it is identical to BlastP.
Its algorithm is designed to conduct further iterations of the search and
to extend the search to distantly related homologues.
PSI stands for Position Specific
Iterated. This search method makes use of a profile, which is a position-specific
accounting of what amino acid residues are found in a family of aligned
homologous proteins. PSI-Blast accepts a protein sequence as input
and first conducts a normal BlastP search to identify homologues in the
database. A profile is constructed from the spectrum of sequences
found in the initially identified homologues. This profile is used
as the search key to identify more distant relatives. The process
is then iterated, each time refining the profile based on inclusion of
the new members. Ideally, the process is expected to converge on
a unique set of genes. In practice, the search may at some point
begin to include proteins that are related by chance similarity.
The user must use judgement to recognize when proteins of known and unrelated
functions begin to appear in the list of finds.
Access to Psi-Blast.
The program itself is downloadable as part of the NCBI blast suite and
can be installed anywhere, even on your PC. However, its main use
requires that it can access an up-to-date copy of a large protein database.
Hence, one would mainly access it at some site that maintains such a database.
The premier Psi-Blast
site is at NCBI. The NCBI Psi-Blast page is accessed with a general
web browser. The UTHSCSA bioinformatics core facility maintains
a mirror of NCBI's nr database and two interfaces to the Blast suite of
programs including Psi-Blast. One of these is a web interface similar
to the NCBI web interface. The other is submission of searches from
commands given in a Linux terminal window.
Pro's and con's of using Psi-Blast in the different environments:
-
At NCBI:
-
Has the most features, and new features show up here first.
-
Has the most flexible formatting of results
-
Has the most flexible means for confining searches to subsets of the database.
-
Plentiful documentation, although some documents are out of date.
-
Turn around from site sometimes bogs down.
-
Interface requires user to reinitiate each iteration. This allows
some tinkering with the sequences that make up the PSSM for the next iteration.
It also makes working through multiple iterations somewhat tedious.
-
NetPsiBlast at UTHSCSA
-
Note that this is not the same as NetBLASt command in the GCG package.
-
Accessible by a standard web
browser interface by individuals with a UTHSCSA bioinf
account.
-
Main benefit: Fast turnaround, because access is limited to UTHSCSA account
holders.
-
This is a slightly less full-featured Psi-Blast server than the one implemented
directly at NCBI. For an accounting of differences in capabilities
see the local
blast help file. This server can be found implemented at other
computing centers. The thing to pay attention to from one site to
the next is what is the database searched and how up-to-date is it.
-
UTHSCSA
databases are nightly updated mirrors of NCBI's databases.
-
Similar interactive interface as at NCBI, requiring resubmission of each
iteration.
-
Links to sequences in the output are back to Entrez at NCBI, not to the
local database.
-
Stand alone Blast suite at UTHSCSA
-
Accessed from a terminal window by the command blastpgp. For syntax
and options see the local
blast help file.
-
Searches the local UTHSCSA-maintained databases, so may be faster than
the NCBI web site at times. But it is slower than the web interface.
-
Main benefit: has many options not available in the web interfaces, including
batch searches and automatic multi-iteration searches. See the local
blast help file for summary of uses.
-
GCG PSIBLAST at UTHSCSA
-
By default this searches the local GCG databases, which are out-of-date.
-
Use the -INfile2=$BLAST_DB/<database name> parameter on the command
line to direct it to search the daily updated databases instead.
See the list
of local blast databases for names of valid databases.
-
Has fewer capabilities than the Blast suite blastpgp program. Type
genhelp
PSIBLAST for summary of features.
-
Main benefit: uses GCG formatted search key.
-
GCG SeqWeb PSIBLAST at UTHSCSA - has been discontinued.
NCBI's Blast help pages:
Profiles.
Search programs for protein
sequence nearly all use a scoring matrix. These allow a partial match
to be assigned for related residues. A profile is a series of position
specific scoring matrixes with one matrix customized for each position
in the sequence. This allows matches at conserved positions to be
more highly valued than matches at positions that diverge freely.
It also allows the kinds of allowed matches to customized for the environment
of the residue. For example, and alanine to arginine substitution
at the surface can be evaluated as favorable at the surface, but be strongly
penalized at interior positions. Although it is possible to construct
profiles based on structural information or physical chemical principles,
psi blast uses the more common method of computing the profile from the
spectrum of residues observed in each position in homologues that are already
identified. This method has the advantage of being computationally
straightforward and automatically applicable to any new sequence.
Psi-Blast profiles are called PSSMs.
Reference:
Altschul,
S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller,
W. & Lipman, D.J. (1997) "Gapped BLAST and
PSI-BLAST: a new generation of protein database search programs." Nucleic
Acids Res. 25:3389-3402.
Summary of the algorithm.
-
The method of searching with a profile is to assign a weight at each position
depending on how frequently the observed residue appears in that position
in a set of aligned sequences that are believed to be valid representatives
of the family, and then to combine these weights into a total matching
score.
-
The aligned sequences are called a "training set".
-
The formal mathematical name for this construct is a "Hidden Markov Model"
or HMM. See Sean Eddy's description
of the program HMMer 2.1.1 (http://hmmer.wustl.edu/) for commentary
on the variety of HMM engines in use.
-
Some documentation has taken to calling the profile a "consensus".
This is confusing, since "consensus" has long been used to mean a single
sequence composed of the most common residue at each position in
a multiple alignment.
-
Whereas it would also be possible to use position-specific gap penalities,
PSI-blast uses the same gap penalty function as ordinary blast.
-
The position specific matrix used is a composite of a standard substitution
matrix and the information derived from the aligned sequence. Hence
one starts with a reasonable scoring matrix and biases it towards the residues
found in homologues. This prevents poor performance in the typical
case where the aligned sequences are too few to independently generate
a statistically valid position-specific matrix.
-
Sequences that are marginally included at one iteration can be tossed out
in a later iteration. However, the original key is always included.
The algorithm has a vulnerability to including a nonhomologue by chance
matching and then dragging in more sequences related to the errant one
such that the profile is taken over by an unrelated family. The biased
retention of the original key provides some resistance against that happening.
Tutorial.
There is an excellent PSI-blast
tutorial given on the NCBI blast page showing how a sample sequence
was analyzed through PSI-blast. Pay particular attention to their
use of a reverse psi blast strategy to judge the meaningfulness of the
inclusion of a distant but well characterized protein. Essentially
by starting psi blast with the distant protein, you allow the profile of
that family to build and then see if psi blast eventually includes your
protein or its close relatives. If the relationship is meaningful,
you would expect the patterns of conservation in the distant family to
parallel those in the close family, and hence the PSI-blast inclusion to
be reciprocol.
However, reciprocol inclusion
may also fail for meaningful relationships. Note the paper by Avarind,
L. and Koonin, E.V. [(1999) Gleaning non-trivial structural,
functional and evolutionary information about proteins by iterative database
searches. J. Mol. Biol, 287:1023-1040] cited on the reading list of the
BLAST tutorial page. It emphasizes how the ability of PSI-blast to
bridge to divergent families is dependent on there being a range of close
and medium related proteins in the database with which to build the profile.
Hence, if your protein sequence fails to hit much in the first iteration,
you could still ask if it belonged to a postulated distant family by keying
the PSI-blast search with a member of that family and seeing if you protein
became included.
If your sequence is not in
GenBank, then there does not appear to be a simple way to include it other
than as the key. You can consider some closely related sequence that
is in GenBank as a surrogate. It is also possible to download PSI-blast
and the relevant part of the database and search on your own computer where
you can include arbitrary sequences.
Options.
Many options have been implemented
for governing PSI-blast searches. One of the most useful is to limit
the search to various subsets of GenBank. The statistics of recognizing
a distant homologue will improve if the database is substantially reduced
in size. However this must be done in an unbiased way. It would
be fair to limit the search of a bacterial gene against only bacterial
genomes. It would not be fair to collect the insignificant hits from
a larger search into a smaller sample and then repeat the search.
The options are explained fully on the Blast
Help Page. The entry fields on the search pages are thoroughly
linked to the help page.
Limit by Entrez Query.
You can put an Entrez query string
in the "limit by Entrez query" box to limit the search to some subset of
sequences. The function of the box has fluctuated over the years,
so you may have to do some trial and error.
-
Over some years the limit was dropped on subsequent Psi-Blast iterations.
In other years the limit was maintained in subsequent iterations.
-
Recently (10/14/2006) the convention that multiword identifiers should
be enclosed in qoutes was altered. "Homo sapiens"[orgn] now returns
no sequences but Homo sapiens[orgn] returns human sequences. This
implies that other elements of the Entrez query language may not work as
expected, so controls would be advisable when using this feature.
This has been reported to NCBI, and presumably may change again.
The Entrez limit box on the format part of the Psi-Blast page and Entrez
itself continue to work according to the previous convention.
Related programs.
-
PHI-Blast
and regular expression searching.
-
BlastP - The first round of Psi-Blast is the same as BlastP. If one
isn't looking for distant homologues, then BlastP will suffice. In
addtion to the above sites, BlastP is available at all major bioinformatics
sites. Some commercial DNA analysis packages contain Psi-Blast clients.
A Psi-Blast client submits the search over the internet to another site
(usually NCBI), and retieves the results. Three such packages with
a site liscense at UTHSCSA are the Wisconsin GCG package, Lasergene (DNA
star), and Vector NTI. Vector NTI advocates a storage system for
Blast results. You can also download a BlastP client for virtually
any platform from NCBI.
-
Psi-Tblastn: This is a strategy to make a PSSM by Psi-Blast in the normal
way, and then use it to search a DNA sequence in all 6 frames. Psi-Tblastn
is implemented in the Standalone
Blast Suite on bioinf.
-
RPS-Blast - The key sequence is searched versus a library of premade PSSMs
representing known protein families. RPS-Blast can be accessed at
NCBI
CDD search page. An RPS-Blast search is automatically conducted
during Psi-Blast searching at NCBI if the CD-search box is checked
on the submission form. RPS-Blast can be run from the UTHSCSA
NetBlast page, and from the linux command line (local
blast help file).
-
HMMER. HMMER searches are a similar strategy to RPS-Blast.
HMMER uses a Hidden Markov Model (HMM) rather than a PSSM to represent
the known protein families. HMMs are more efficient at dealing with
gaps in comparisons between distant homologues. The best place to
search a sequence against libraries of protein families is at the Sanger
Center Pfam site. See also the local
HMMER help file, and comparison
of different protein family alignment tools, for the capabilities of
the UTHSCSA HMMER installation.
-
SAM. For matching a sequence to the greatest range of superfamily
members, use SAM. A sequence can be extensively analyzed to the level
of fold recognition at the UCSC SAM
site. See also the local
SAM help file, for extended capabilities offered by the local installation
of the SAM suite.
-
PsiPred - Psipred is a secondary structure prediction system that first
aligns homologues using Psi-Blast, and then issues a consensus secondary
structure prediction. PsiPred predictions can be done at the Psipred
web site. PsiPred can also be run locally from bioinf.
Last updated 3/31/2003 - Steve Hardies