PSI-Blast.

        PSI-Blast  is the preferred method for searching a protein database with a protein sequence as the key.  If used for only one round, it is identical to BlastP.  Its algorithm is designed to conduct further iterations of the search and to extend the search to distantly related homologues.

        PSI stands for Position Specific Iterated.  This search method makes use of a profile, which is a position-specific accounting of what amino acid residues are found in a family of aligned homologous proteins.  PSI-Blast accepts a protein sequence as input and first conducts a normal BlastP search to identify homologues in the database.  A profile is constructed from the spectrum of sequences found in the initially identified homologues.  This profile is used as the search key to identify more distant relatives.  The process is then iterated, each time refining the profile based on inclusion of the new members.  Ideally, the process is expected to converge on a unique set of genes.  In practice, the search may at some point begin to include proteins that are related by chance similarity.  The user must use judgement to recognize when proteins of known and unrelated functions begin to appear in the list of finds.
 

Access to Psi-Blast.

The program itself is downloadable as part of the NCBI blast suite and can be installed anywhere, even on your PC.  However, its main use requires that it can access an up-to-date copy of a large protein database.  Hence, one would mainly access it at some site that maintains such a database.  The premier Psi-Blast site is at NCBI.  The NCBI Psi-Blast page is accessed with a general web browser.   The UTHSCSA bioinformatics core facility maintains a mirror of NCBI's nr database and two interfaces to the Blast suite of programs including Psi-Blast.  One of these is a web interface similar to the NCBI web interface.  The other is submission of searches from commands given in a Linux terminal window.

Pro's and con's of using Psi-Blast in the different environments:

NCBI's Blast help pages:

Profiles.

        Search programs for protein sequence nearly all use a scoring matrix.  These allow a partial match to be assigned for related residues.  A profile is a series of position specific scoring matrixes with one matrix customized for each position in the sequence.  This allows matches at conserved positions to be more highly valued than matches at positions that diverge freely.  It also allows the kinds of allowed matches to customized for the environment of the residue.  For example, and alanine to arginine substitution at the surface can be evaluated as favorable at the surface, but be strongly penalized at interior positions.  Although it is possible to construct profiles based on structural information or physical chemical principles, psi blast uses the more common method of computing the profile from the spectrum of residues observed in each position in homologues that are already identified.  This method has the advantage of being computationally straightforward and automatically applicable to any new sequence.  Psi-Blast profiles are called PSSMs.

Reference:

Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997) "Gapped BLAST and
PSI-BLAST: a new generation of protein database search programs." Nucleic Acids Res. 25:3389-3402.

Summary of the algorithm.

  1. The method of searching with a profile is to assign a weight at each position depending on how frequently the observed residue appears in that position in a set of aligned sequences that are believed to be valid representatives of the family, and then to combine these weights into a total matching score.
    1. The aligned sequences are called a "training set".
    2. The formal mathematical name for this construct is a "Hidden Markov Model" or HMM.  See  Sean Eddy's description of the program HMMer 2.1.1 (http://hmmer.wustl.edu/) for commentary on the variety of HMM engines in use.
    3. Some documentation has taken to calling the profile a "consensus".  This is confusing, since "consensus" has long been used to mean a single sequence composed of  the most common residue at each position in a multiple alignment.
  2. Whereas it would also be possible to use position-specific gap penalities, PSI-blast uses the same gap penalty function as ordinary blast.
  3. The position specific matrix used is a composite of a standard substitution matrix and the information derived from the aligned sequence.  Hence one starts with a reasonable scoring matrix and biases it towards the residues found in homologues.  This prevents poor performance in the typical case where the aligned sequences are too few to independently generate a statistically valid position-specific matrix.
  4. Sequences that are marginally included at one iteration can be tossed out in a later iteration.  However, the original key is always included.  The algorithm has a vulnerability to including a nonhomologue by chance matching and then dragging in more sequences related to the errant one such that the profile is taken over by an unrelated family.  The biased retention of the original key provides some resistance against that happening.

Tutorial.

        There is an excellent PSI-blast tutorial given on the NCBI blast page showing how a sample sequence was analyzed through PSI-blast.  Pay particular attention to their use of a reverse psi blast strategy to judge the meaningfulness of the inclusion of a distant but well characterized protein.  Essentially by starting psi blast with the distant protein, you allow the profile of that family to build and then see if psi blast eventually includes your protein or its close relatives.  If the relationship is meaningful, you would expect the patterns of conservation in the distant family to parallel those in the close family, and hence the PSI-blast inclusion to be reciprocol.

        However, reciprocol inclusion may also fail for meaningful relationships.  Note the paper by Avarind, L. and Koonin, E.V.  [(1999) Gleaning non-trivial structural, functional and evolutionary information about proteins by iterative database searches. J. Mol. Biol, 287:1023-1040] cited on the reading list of the BLAST tutorial page.  It emphasizes how the ability of PSI-blast to bridge to divergent families is dependent on there being a range of close and medium related proteins in the database with which to build the profile.   Hence, if your protein sequence fails to hit much in the first iteration, you could still ask if it belonged to a postulated distant family by keying the PSI-blast search with a member of that family and seeing if you protein became included.

        If your sequence is not in GenBank, then there does not appear to be a simple way to include it other than as the key.  You can consider some closely related sequence that is in GenBank as a surrogate.  It is also possible to download PSI-blast and the relevant part of the database and search on your own computer where you can include arbitrary sequences.
 

Options.

        Many options have been implemented for governing PSI-blast searches.  One of the most useful is to limit the search to various subsets of GenBank.  The statistics of recognizing a distant homologue will improve if the database is substantially reduced in size.  However this must be done in an unbiased way.  It would be fair to limit the search of a bacterial gene against only bacterial genomes.  It would not be fair to collect the insignificant hits from a larger search into a smaller sample and then repeat the search.  The options are explained fully on the Blast Help Page.  The entry fields on the search pages are thoroughly linked to the help page.
 

Limit by Entrez Query.

       You can put an Entrez query string in the "limit by Entrez query" box to limit the search to some subset of sequences.  The function of the box has fluctuated over the years, so you may have to do some trial and error.

Related programs.



Last updated 3/31/2003 - Steve Hardies