Comparison among different programming systems for organizing protein families.
Scope: Psi-Blast, ClustalW, HMMER, SAM
Caveat: Each system is actively upgraded, often with the motive
to incorporate features derived from the others. Hence, a capability
cited as lacking in some program in the comments below may appear in a
later version. Links to pages describing the individual programs
should be followed to find the latest innovations in each system.
Psi-Blast
Psi-Blast is generally used to search the entire protein database for distant
homologues. It is a multi-iteration search tool, for which the first
round is identical to BlastP. In each additional iteration, it composes
a Position Specific Scoring Matrix (PSSM) from the sequences already found,
and uses that as the key in the subsequent iteration. The PSSM has
a different matching matrix for each position in the alignment of sequences
already found. The matrix for each position is based on the general
matching matrix (generally Blossum 62) upweighted for residues already
found at that position in the alignment. The scheme is very similar
to that used in ClustalW. It follows from the development of profile
methods, and is a forerunner to the Hidden Markov Models (HMM) used by
HMMER and SAM. It differs from the HMM methods in that gaps are placed
according to the BlastP algorithm (a fixed penalty plus a penalty for extension),
rather than learning position specific gap penalties from the sequences
already aligned.
-
Implementation:
-
At NCBI, and other major bioinformatics institutes. Interactive interface
only.
-
Local version at UTHSCSA. This is an implementation of NCBI's downloadable
Standalone Blast Suite. It provides some additional flexibility of
use. Interactive interface or batch processing.
-
Downloadable versions for PCs are found at NCBI. Downloading and
installing a local copy of the Blast suite is easy. Downloading and
maintaining an up-to-date copy of the databases is not so easy.
-
Characteristics:
-
Always does submodel vs. subsequence scoring. This means that alignments
for differing domains with different sets of homologues will develop during
the same iterative search. However, you can not automatically constrain
it to only add sequences that match across all domains. You can do
that manually from the interactive interface.
-
Statistical evaluation of significant matches is empirically based.
-
Statistics are automatically adjusted to database size, and the database
can be restricted in various ways. This can give greatly increased
sensitivity if the database restriction is based on valid outside knowledge,
or a horrible artifact if the database restriction is arbitrary.
The database size for computing E values can be overridden by a user entry.
-
The ability to extend to highly divergent sequences depends on the availability
of intermediate sequences with which to form the PSSM.
-
Major advantages:
-
Speed. Because of the simple gap penalty form, Psi-Blast is fast
enough to search the entire protein database on each iteration. Hence
it is an excellent choice for the front end of an analysis wherein the
found sequences will be subjected to a more intensive algorithm.
-
The NCBI interface gives a particularly good graphic representation for
evaluating domain structure of a sequence.
-
Will often find extensively divergent homologues of 15% or less identity.
-
Major disadvantages:
-
Simple gap function gives relatively poor alignments. Alignments
have to be worked over retrospectively by another system. Output
of aligned sequences is not in a standard format for input to other programs,
hence porting the found information can generate a lot of fiddling.
The poor alignments involving divergent sequence may be somewhat offset
by manually increasing the gap penalty while reducing the gap extension
penalty.
-
Validity of the statistical evaluation becomes increasingly questionable
for increasing numbers of sequences in the PSSM. The user has to
use subjective criterion to decide when to cut off the search, or
it may branch into another family related only by chance similarity.
-
Variations:
-
Reverse Psi-Blast:
-
An ad hoc reverse Psi-Blast control can be performed by starting
a Psi-Blast search from a proposed distant homologue and seeing if it finds
its way back to the initial key sequence. If the reverse search finds
lots of variously related sequences but fails to find the initial key,
that would tend to reject a true relationship. If the reverse search
can not extend far because of a lack of intermediate sequence, then it
provides no information about the proposed relationship.
-
RPS-Blast. A library of PSSM models describing various protein families
is maintained at NCBI. The CD search facility searches a sequence
against the library of models using the Psi-Blast algorithm. The
same capability is represented locally by the RPS blast implementation.
PSSM models for this purpose are generally developed from alignments generated
by programs better at placing gaps than is Psi-Blast. Most of NCBI's
models are derived from HMMER HMMs in Pfam.
-
Starting a forward search from a previously developed PSSM (or alignment).
-
NCBI's version is rigged to allow entry of a PSSM derived from a prior
search (or otherwise), and continue expanding it through iterative searches
of the database.
-
The local version can accept an alignment to perform this function.
This is an excellent way to take a protein family alignment formed by some
program better at placing gaps and then continue searching for homologues
with the fast Psi-Blast algorithm.
-
PSI-TBLASTN (local only). The PSSM developed by Psi-Blast
is used to search a nucleotide database translated in all six frames.
This can use the sensitivity of Psi-Blast to find genes in genomes missed
by the original annotators.
-
Secondary structure prediction.
-
The Psi-Pred system, accessed either remotely, or locally, compiles the
PSSM for a key sequence and uses that additional information to improve
secondary structure prediction.
Clustal W is a profile-based progressive alignment system. That means
it first aligns the closest sequences to form a profile, representing the
different residues at each position. It subsequently aligns sequence
to profile and profile to profile using the information in the profile
to help align distant families. Clustal W is the most commonly used
algorithm after Psi-Blast, and the most commonly used algorithm to prepare
seed alignments for HMMER.
Implementation:
-
Remote at EBI.
-
Local at UTHSCSA (local implementations have some additional flexibility).
-
One can download copies for PCs from EBI. The size of the alignment
than can be dealt with is limited by the memory capacity of the machine.
Clustal X is a version with a color graphics interface.
Characteristics:
-
Clustal W is not a database searching system. It is used to align
a set of sequences proposed to be homologues by another system.
-
Gap penalties are more sophisticated than Psi-Blast, with decreased penalties
where sequences in alignment already have a gap, and an automatic capacity
to avoid gaps in hydrophobic runs. However, they are still empirically
adjusted, and the user may expect to do a lot of subjective fiddling if
the alignment has very divergent sequences.
-
Similar to Psi-Blast, Clustal W can accept a subset of the alignment prealigned
by another system, and then carry on.
-
Can accept a secondary structure map, and adjust gap penalties to avoid
gapping within secondary structure elements.
-
Can accept an arbitrary set of position specific gap penalties from the
user.
Advantages:
-
Often implemented together with an Neighbor Joining phylogenetic analysis
package, including Bootstrap analysis. This provide an easy interface
to a means to make a statistically defensible statement about the tree
relating the sequences.
-
Implemented within Seaview multiple alignment editor, such that local regions
of a larger alignment can be subjected to revision by clustalw.
-
Default parameters widely accepted to produce an acceptable "objective"
alignment on sequences >= 25% identity.
-
May do better than HMMER or SAM at relating divergent but well populated
families.
-
ClustalW will allow preserving prior alignments of two or more subsets
as the global alignment is formed. The other methods will at most
preserve the prior alignment of only one group.
Disadvantages:
-
Is easily confused by long stretches of unalignable sequences within otherwise
well related sequences.
-
Often produces a kind of false objectivity, where the user has fiddled
with the program parameters to achieve a subjectively pleasing result,
rather than just manually editing the alignment.
-
Has no statistical evaluatory properties. It will produce an alignment
whether the provided sequences are related or not.
-
Is less good than HMMER or SAM at adding a single divergent sequence to
a family.
Related programs:
-
Pileup in the GCG system is related to ClustalW, but performs substantially
less well on even moderately divergent sequences.
-
T-Coffee is a related system, which is considered by some to perform slightly
better (but more slowly) than ClustalW.
HMMER creates an HMM model of a protein family. The HMMER HMM contains
both position-specific residue information and position specific gap penalties.
To calculate these it uses a build cycle, wherein it iteratively computes
the gap penalties from the existing alignment, and then realigns the sequences
according to the position specific gap penalties. In principle, it
should be able to achieve its final alignment and model starting from unaligned
sequences. In practice the process is more likely to settle on a
good final model if the sequences are prealigned. Most typically
this is done by ClustalW. Most typically HMMER is used as follows:
-
The degree of divergence to be considered belonging to a particular protein
family is preconceived and a seed alignment is prepared by another method,
typically ClustalW.
-
The HMMER build program is used to make the HMM.
-
The HMM is used to search the global protein database, and additional homologues
are aligned with the original seed sequences.
-
Cutoffs are recommended to exclude sequences that exceed the original expert's
preconception about what should belong to the family.
-
Pfam, and some other databases provide a library of such models, the original
alignments created from the build process (called seed alignments), and
the more global alignments. There is also usually a facility to search
a novel sequence against the library and suggest which families it may
join.
Implementation.
-
For searching a novel sequence against a premade library, the best place
to do that is at the site hosting the library. For Pfam, that is
Sanger Center. Since the HMMER library concept is tied to decisions
made by "experts" during assembly of the family, you will only be able
to retrieve the seed alignments representing the judgments of those experts
from a separate database at the hosting site.
-
Numbers of programming packages allow importation of the pfam library and
will run the search against it. But usually information about the
seed alignments, is lost in these implementations. Also Pfam updates
monthly, so it is easy to be tricked into searching an out-of-date version.
-
There is a local implementation at UTHSCSA that permits additional exploration
of the families. It should be used in conjunction with the remotes
sites; not instead of them.
-
The HMMER programs implemented within GCG are essentially interchangeable
with the free standing ones, except for file format issues. The Pfam
database version internal to the GCG distribution suffers from falling
out of date.
Characteristics.
-
The program doesn't deal well with large separation between alignable regions.
Hence, the families analyzed end up corresponding to protein domains.
-
The nature of the program requires separate models to be built to cover
an intent to detect homology to a portion of a domain in a sequence versus
the intent to evaluate homology across the entire domain. Therefore,
Pfam has two divisions to cover these two purposes.
-
The model building process involves a mixture of prior expectations and
the observed distribution of gaps and residues. There is some flexibility
for adjusting the prior expectations (called "priors" in the documentation).
Advantages and Disadvantages.
-
Versus ClustalW as an alignment tool, HMMER provides a more objective placement
of gaps. However, ClustalW provides the user with more capability
to introduce secondary structural considerations.
-
HMMER can be used as a search tool for additional homologues, whereas ClustalW
can not.
-
HMMER alignments can be converted to PSSMs and hence used by Psi-Blast.
This gives a much faster database search method, with the poor alignment
performance of Psi-Blast somewhat offset by the intelligent placement of
gaps by HMMER in the original alignment. This is the basis of NCBI's
RPS-Blast system (aka CD search).
-
In terms of the ease for distributing models of new families over the web
that others might use: PSSMs or the associated alignments are the easiest
for most others to grab and use, since if they are formatted right, they
can go straight into NCBI's web page for further searching. Many
packages have HMMER implementations, so many could user HMMER HMM models
directly. SAM implementations are relatively rare, so SAM models
would usually be converted to HMMER models for distribution.
-
Compared to SAM, HMMER could be used cyclically to expand to larger and
more even divergent families, but it is really not designed to do so.
SAM is. Compared to SAM, HMMER's selection of priors is limited,
and its interaction with secondary structure is rudimentary.
-
HMMER is relatively dependent on sequences being divided up into domains
in advance of model building.
-
HMMER results are difficult to divorce from "expert" opinion, since both
defining domains and defining the degree of divergence to be included in
a protein family are based on preconceptions rather than computation.
SAM
(Sequence Alignment and Modeling Tool).
SAM is very much like HMMER algorithmically. Its implementation has
a mode for searching against premade libraries like HMMER, but also a search
to compile a family starting from one sequence like Psi-Blast. The developers
of SAM have focused on fold recognition, and hence SAM has many facilities
related to searching at higher degrees of divergence than typically done
with HMMER.
Implementation
-
The SAM web site at UCSC will provide assembly of a family from a single
sequence, roughly like conducting four cycles of Psi-Blast. It will
provide consensus secondary structure prediction, and also scores reflecting
matching to a HMM family representing different protein folds.
-
The local implementation of SAM provides the target99 script which can
be used to expand a family through an additional arbitrary number of cycles,
define it based on arbitrary seed alignments, and perform other tricks
to achieve high sensitivity in incorporating divergent members. Secondary
structure prediction can be incorporated, but not in the automatic sense
it is at the UCSC site. The local implementation does not include
the fold library.
Characteristics
-
The SAM search process from a single sequence does four cycles of searching
and HMM building. As more divergent sequences are included, it automatically
switches the prior information about how gaps should be defined appropriately.
-
It's final alignment is made with some automatic sensitivity to avoiding
gaps in secondary structure elements.
-
If there is 3D information for the sequence, this information can be strongly
incorporated into the process.
-
SAM has a facility to mask portions of the starting sequence, thus focusing
the search and alignment process on one or more known conserved domains.
-
SAM has some facility to incorporate knowledge about secondary structure
in the searched sequences (as opposed to the key sequences), but this is
not well developed in the package at this time. It would require
conducting a prior secondary structure prediction over all sequences in
the searched database.
Advantages/ disadvantages
-
Compared to HMMER, SAM can more easily incorporate multiple domains within
the same family alignment. However, the developers still advise pairing
down to individual domains if possible.
-
Compared to HMMER, SAM can more easily extend the family to more divergent
homologues. However, like Psi-Blast, the user has to be more on guard
against jumping over into an unrelated family.
-
Compared to HMMER, SAM is much better set up to start from a prior alignment
and carry on extending the family. It has flexibility about how or
whether to adjust the prior alignment in the process.
-
When adding sequences a prior alignment, SAM is capable of using that information
to extend alignment beyond the domains that were originally aligned.
Psi-Blast does this to a limited extent. HMMER is bound to its domain models.
-
There seems to be little option for distributing a SAM HMM other than converting
it to a HMMER HMM, which is supported in the package. The HMMER HMM
should retain the degree of divergence recognized, and the structural sensitivity,
but may not be able to handle multiple domains in the same HMM.
-
SAM is very slow at searching compared to HMMER (which is already slow
compared to Psi-Blast), to the point that SAM can not realistically search
the entire nr database. So it uses blast as a prefilter to pick out
the set of sequences within E<300 from each sequence in the seed alignment.
The burden of Blast searching can become prohibitive if the seed alignment
is very large to start with. And further, a given divergent target
might never get searched for matching to the HMM. A reasonable work-around
is to first run many iterations of Psi-Blast at some low stringency.
Then compile a separate database using fastacmd of all of those hits.
Then make SAM search that database without the Blast prefilter.
Last updated 3/22/2003 - Steve Hardies