The local HMMER implementation allows additional exploration of protein families. HMM models can be extracted from Pfam and used as a tool to make alignments of family members or seek more divergent family members. HMM models for novel protein families can be created, or existing families can be expanded or subdivided. For expanding to superfamilies, one should look into the Sequence Alignment and Modeling program suite (SAM). SAM is much like HMMER, but has additional features that are particularly helpful at characterizing superfamilies. Alternatively, one could define a HMMER model starting from an alignment made by any other trusted method. Most typically, new HMMER models are made after a Blast search followed by a Clustal alignment.
Pfam uses expert annotators to divide protein sequences into domains and to decide how divergent sequences should be before they are considered not to belong to the family. There are two Pfam libraries that can be searched, each containing a model for each domain represented in the database. The path and name for the local versions of the libraries are:
Identifying a specific Pfam model at Sanger Center:
You are trying to find the Pfam accession, which is something like PF05119. Most typically you would go to the protein search tab, and find the matching domain(s) by pasting a representative sequence into the search box. There is a keyword search also available, but the number of keywords associated with each domain is not very large. Pfam uses different accession numbers than NCBI, so you will often find it simpler to find the families by the protein sequence search than by searching for an accession number. The RPS search page at NCBI also searches Pfam, but by a slightly less powerful search algorithm. You get an automatic RPS search accompanying each Psi-Blast search, if you check the CD-search box.
Identifying a specific Pfam model in the local Pfam distribution:
The command to search a given key sequence against the local Pfam library
is:
hmmpfam --acc $HMMERDB/Pfam_ls key.fa > results.out
or
hmmpfam --acc $HMMERDB/Pfam_fs key.fa > results.out
Where, the --acc option causes the Pfam accession number to be
given rather than a common name for the family. key.fa
indicates a fasta file with the protein sequence to be searched.
results.out is an arbitrary output file name. The key
could be a multi-fasta file with multiple sequences which will all be searched.
A variety of other formats for the input sequence are auto-detected.
For that and other options, see the userguide.
Note: The precomputed alignments corresponding to the models are not maintained locally. One could retrieve them by entering the Pfam accession number at the Sanger site, or one of the mirror sites.
Extracting a specific model from the local Pfam distribution:
The syntax below removes model PF05119 from the Pfam_fs library to a
separate file.
hmmfetch $HMMERDB/Pfam_fs PF05119 >PF05119fs.mod
Searching the nonredundant protein database for matches to a model.
The following syntax searches the local nr database for matches to the
model extracted above:
hmmsearch [options] PF05119fs.mod $BLASTDB/nr >results.out
Refer to the userguide
for options to adjust the stringency of the search.
This search can take several hours.
Searching a specialized collection of sequences for matches to a model.
These, for example, could be sequences you had determined yourself and had not yet submitted to GenBank.
hmmsearch -Z n [options] PF05119fs.mod seqs.fa >results.out
where seqs.fa is any multifasta-formatted file of protein sequences.
The -Z n option is to set the number of sequences in the database for
E value calculation to that in nr. This will cause the E values
reported for the arbitrarily small seqs.fa to be the same as if those sequences
were found in nr.. To find the value of n, do fastacmd -d $BLAST_DB/nr
-I
If you want to merge your own sequences with nr so that the hits will be reported in context, just copy nr to your own directory and concatenate your sequences to it. Remember to erase it when you're done, so as not to tie up disk space with obsolete copies of the protein library.
cp $BLAST_DB/nr .
cat nr seqs.fa > newlib
hmmsearch [options] PF05119fs.mod >results.out
rm nr
rm newlib
Using an HMM model to create an alignment.
Making a new HMMER model.
CLUSTAL W (1.7) multiple sequence alignment
and tolerates no number lines or other liberties with the format.