HMMER

Version installed is 2.3.2.  Program status.

Sources of documentation:

Summary of HMMER uses

HMMER is typically used to make HMM models of protein families exhibiting on the order of 20-25% identity.  The HMM models are then accumulated into a library, and protein sequences are subsequently searched against the library to assign them to families by homology.  The major international database for HMMER HMM models is Pfam.  The best place to search a new protein sequence against Pfam is at the Sanger Center web site.  In addition to Pfam, you can search two other extensive HMM libraries from this site.  One is called SMART, and the other is maintained by TIGR.  A variety of mirror sites exist, but do not contain all of the functionality of the Sanger Site.

The local HMMER implementation allows additional exploration of protein families.  HMM models can be extracted from Pfam and used as a tool to make alignments of family members or seek more divergent family members.    HMM models for novel protein families can be created, or existing families can be expanded or subdivided.  For expanding to superfamilies, one should look into the Sequence Alignment and Modeling program suite (SAM).  SAM  is much like HMMER, but has additional features that are particularly helpful at characterizing superfamilies.  Alternatively, one could define a HMMER model starting from an alignment made by any other trusted method.  Most typically, new HMMER models are made after a Blast search followed by a Clustal alignment.

Nature of the Pfam libraries:

An extensive library of HMMER models for various protein families is maintained in the Pfam database at the Sanger Center, and at several mirror sites including Pfam at Washingtion Univ., St. Louis.  The Pfam databases at those sites also provide two precomputed alignments for each family: one consists of the seed alignment used to create their model, and a second alignment including additional sequences detected by their model.  Pfam imposes fairly conservative limitations on inclusion of sequences in its alignments. You may wish to redevelop the alignment with looser limits for inclusion of sequences, or by including sequences not present in the database they used.  This requires obtaining the HMM model corresponding to the family of interest.  UTHSCSA keeps local copies of the Pfam HMM libraries from which you can extract individual HMM models for further exploration.

Pfam uses expert annotators to divide protein sequences into domains and to decide how divergent sequences should be before they are considered not to belong to the family.  There are two Pfam libraries that can be searched, each containing a model for each domain represented in the database.  The path and name for the local versions of the libraries are:

Note that searching at the remote Pfam sites also requires deciding between these two modes.  The first mode (called local-local) may be able to identify a fragment of a domain, for example in an EST or other fragment of sequence, or if the defined domain can really be subdivided further by natural recombinations.  The second mode (called global-local, or glocal) may be more sensitive for finding divergent sequences containing the full domain.

Identifying a specific Pfam model at Sanger Center:

You are trying to find the Pfam accession, which is something like PF05119. Most typically you would go to the protein search tab, and find the matching domain(s) by pasting a representative sequence into the search box.  There is a keyword search also available, but  the number of keywords associated with each domain is not very large.  Pfam uses different accession numbers than NCBI, so you will often find it simpler to find the families by the protein sequence search than by searching for an accession number.  The RPS search page at NCBI also searches Pfam, but by a slightly less powerful search algorithm.  You get an automatic RPS search accompanying each Psi-Blast search, if you check the CD-search box.

Identifying a specific Pfam model in the local Pfam distribution:

The command to search a given key sequence against the local Pfam library is:
 hmmpfam --acc $HMMERDB/Pfam_ls key.fa > results.out
or
hmmpfam --acc $HMMERDB/Pfam_fs key.fa > results.out

Where, the --acc option causes the Pfam accession number to be given rather than a common name for the family.   key.fa indicates a fasta file with the protein sequence to be searched.
results.out is an arbitrary output file name.  The key could be a multi-fasta file with multiple sequences which will all be searched.  A variety of other formats for the input sequence are auto-detected.  For that and other options, see the userguide.

Note:  The precomputed alignments corresponding to the models are not maintained locally.  One could retrieve them by entering the Pfam accession number at the Sanger site, or one of the mirror sites.

Extracting a specific model from the local Pfam distribution:

The syntax below removes model PF05119 from the Pfam_fs library to a separate file.
hmmfetch $HMMERDB/Pfam_fs PF05119 >PF05119fs.mod

Searching the nonredundant protein database for matches to a model.

The following syntax searches the local nr database for matches to the model extracted above:
hmmsearch [options] PF05119fs.mod $BLASTDB/nr >results.out
Refer to the userguide for options to adjust the stringency of the search.
This search can take several hours.

Searching a specialized collection of sequences for matches to a model.

These, for example, could be sequences you had determined yourself and had not yet submitted to GenBank.

hmmsearch -Z n [options] PF05119fs.mod seqs.fa  >results.out
where seqs.fa is any multifasta-formatted file of protein sequences.
The -Z n option is to set the number of sequences in the database for E value calculation to that in nr.   This will cause the E values reported for the arbitrarily small seqs.fa to be the same as if those sequences were found in nr..  To find the value of n, do fastacmd -d $BLAST_DB/nr -I

If you want to merge your own sequences with nr so that the hits will be reported in context, just copy nr to your own directory and concatenate your sequences to it.  Remember to erase it when you're done, so as not to tie up disk space with obsolete copies of the protein library.

cp $BLAST_DB/nr .
cat nr seqs.fa > newlib
hmmsearch  [options] PF05119fs.mod >results.out
rm nr
rm newlib

Using an HMM model to create an alignment.


Making a new HMMER model.

Further programs in the system and options can be found in the userguide.



Last updated 5/26/2006 - Steve Hardies