SAM is a system that applies Hidden Markov Matrix methods to the problem
of analyzing protein familes.
Typical uses are:
SAM
documentation.
Test data files mentioned in the documentation are found in /usr/local/sam3.3.1/sam-demos
One can read some of the scripts for further documentation in /usr/local/sam3.3.1/bin
Some of the programs give usage information if started with an invalid
command line, or the -help parameter.
For running local target99 jobs, particularly in a learning mode,
it is essential to capture the standard error output to a log file and
read it. See unix
tips.
Because of its more sophisticated search strategy, target99 may find distant homologues that Psi-Blast will miss. However, target99 can also miss distant homologues that Psi-Blast will find for the following reason. On each iteration where SAM uses a prior alignment to search for more distant homologues in the nr database, it confines its search to a subset of nr that matches the initial sequence by BLAST to within some loose limits (typically E < 300). This is to reduce computing time. So in the default target99 strategy, divergent homologues outside of that initial set will never be considered even though the model constructed by SAM might recognize them. One can override this problem by use of the -homologs option to specify a multifasta file containing an alternative set of sequences to search instead of the set filtered out of nr by blast. For example, one could take the alignment output by target99 in default mode and use it to seed a Psi-Blast search. One could then download the set of Psi-Blast finds at some large E value and use it with the -homologs option in a subsequent target99 search. Alternativley, one can use the alignment from a target99 job to seed another target99 job. Since the prefilter will admit sequences within E < 300 to any of the members of the alignment, more divergent homologues may now be admitted. However, the target99 search slows down dramatically with a large alignment seed, and a limit of 10,000 sequences passing the prefilter is imposed for practicality's sake.
Target99 actually uses two progressively larger sets to build its alignment. On the first pass it uses a set of sequences close to the seed sequence (typicall E< 5.e-4 adusted by parameter -blast_close; see doc section 4.8 for adjustable parameters). On subsequent passes it uses the broader set mentioned above (adjusted by parameter -blast_e). The nature of the search key and the data to be searched can be altered as follows. In all cases the result would be an alignment in an output file named foo.a2m.
| Variation | Example |
| Basic target99 search to find homologues of a single seed sequence,
and create an alignment.
If after the -seed parameter a fasta filename containing a single sequence is given, then close and expanded sequence sets derived from nr by blast will be used as described above. A subsequent tuneup job (see below) is advisable to improve the alignment. NOTE: lower case characters in seed sequences or alignments are ignored. |
target99 -seed single.seq -db $BLAST_DB/nr -out foo 2> foo.log |
| Basic target99 search to expand a family represented by a "trusted" alignment. The trusted alignment could be from a preceeding target99 job followed by a tuneup, or from pfam, or from a collection of homologues derived by some other method and aligned, say by culstalW. If after the -seed parameter an alignment file is given (in multifasta, or a2m format), then that alignment will be used without revision to key the search which is otherwise as above. [All sequences are used to do the Blast prefilter search and the union is taken. If the seed alignment is very large, the blast searching and the size of the prefiltered files may become prohibitive.] | target99 -seed align.a2m -db $BLAST_DB/nr -out foo 2>foo.log |
| If -seed <alignment> and -force_seed 0 is given, then it is as above only the sequences from the seed are realigned at each iteration. In most cases, one would run the basic target99 job separately and then tuneup instead of trying to compress both processes into one job like this. | target99 -seed align.a2m -force_seed 0 -db $BLAST_DB/nr -out foo 2>foo.log |
| Tuneup: If -seed <alignment> is given and -tuneup, then the seed alignment is realigned, but no other searching is done. The seed alignment file has to be aligned within reason to start with. To align a set of sequences that are completely unaligned, use the -homologs option with -seed (below) | target99 -seed align.a2m -tuneup -out foo 2>foo.log |
| To extend alignment into regions that had orginally been ignored and demoted to lower case. Do a tuneup job as indicated above. The tuneup may also demote additional sequences to lower case. | |
| Searching a defined set of sequences. This is a way to force candidates for distant homologues past the prefilter. The set could be all sequences found by multiple rounds of Psi-blast, or all sequences returned by Entrez by a particular definition (like "kinase"), or all sequences of a particular taxonomic kind (like "viruses"[orgn]), or sequences found by Blast at E<300 for a subset of the already aligned sequences suspected of belonging to a poorly represented distant related family. Only the set specified by -homologs is searched. nr is not searched. | target99 -seed align.a2m -homologs seqs.fa -out foo 2>foo.log |
| Using predefined close and far homologues.
-close <fasta file>, used together with -homologs <fasta file> causes the set of sequence named with -close to be searched in the first iteration, and the set of sequences named with -homologs to be searched for subsequent iterations. No nr database search is done. This is essentially like the basic target99 search except the user controls which sequences are considered on the first round, and which are considered thereafter. |
target99 -seed single.seq -close some_seqs.fa -homologs more_seqs.fa -out foo 2>foo.log |
| Aligning unaligned sequences.
If -seed is omitted, but -close or -homologs is given, then SAM attempts to align the unaligned sequences. ClustalW or T-Coffee may be a little better at this job. All of the methods have trouble if there are long unalignable sections interspersed with conserved motifs. For difficult alignments, try FIMs to mask unalignable regions (below). |
target99 -homologs some_related_sequences.fa -out foo 2>foo.log |
| Masking regions. Conserved domains separated by poorly conserved regions are most easily discovered in a given seed sequence by doing many rounds of PsiBlast at the NCBI server and looking at the graphical result display. Most often, one is advised to develop the family for each domain separately. However, if one wants to develop the family while keeping the domains together, then in the seed sequence edit a single "O" in place of each unconserved segment to be masked. This is called a FIM. See doc section 8.5. Equivalently, you could demote the region to be masked to lower case. FIMs are typically placed on both ends of a domain seed, causing there to be no end gap penalty. If you want to adde FIMs to a pre-existing alignment, add them to a seed sequence and redevelop the alignment with -homologs on the .a2m file carrying the previous alignment. | |
| Using target99 to build the trusted alignment. If starting from a pfam family, or other family supplied without the NCBI master gi numbers, the following trick will rebuild and tune up the alignment out of NCBI sequences. This has the advantage of avoiding accumulation of unrecognized duplicate sequences within the alignment. First, convert the pfam alignment to a multifasta file. Use that as a seed to a target99 job with a very stringent E value cutoff for the Blast prefilter. The resulting alignment will have just the original seed sequences and their equivalents from nr. Then use the uniqueseq utility to subtract the original sequences. | target99 -seed pfam.fa -db $BLAST_DB/nr -out foo 2>bar.log
uniqueseq <newalign_name> -alignfile foo.a2m -train pfam.fa |
| To search a library subset too big for -homologs:
Instead of searching nr, you may substitute a customized database formatted for blast searching. The custom database would be first represented by a fasta file. This might be downloaded from Entrez or Batch-Entrez at NCBI, or retrieved from the local nr library based on a file containing the list of gi number. This should be formatted by formatdb, and the fasta file and the formatted files should remain in the same directory. The Blast prefilter would then be more sensitive to divergent homologues, and it might be possible to open it up further because of the reduction in noise. If the database is not so large as to require the Blast prefilter, then give the fasta file with the -homologs option and formatting for Blast will not be necessary. |
fastacmd -d $BLAST_DB/nr -i <filename with gi list> -tT -o custom_lib
formatdb -i custom_lib -oT -pT target99 -seed {single.seq or align.a2m} -db custom_lib -blast_e 1000 -out foo 2>foo.log |
| Task | Do |
| Convert an alignment (input.a2m file) to a "human readable" (meaning
any program other than SAM) form. The .a2m file created by SAM
demotes some some segments considered as insertions to lower case.
For other programs to read the file in alignment, either the lower case
characters must be removed, or gaps must be added to the other files to
balance them. The program prettyalign adds periods to balance the
lower case characters. The output file is interleaved. If the
header of the output file is replaced with the line CLUSTAL W, followed
by a blank line, then it can be read into most other programs as a CLUSTAL
formatted (or .aln) file. Significant options:
-f -- creates a fasta file for output
-m4 -- makes inserts to 4 characters, eg a 80 character insert starting with a and ending with v would be inserted as a80v and balanced by .... in the other sequences. |
prettyalign input.a2m >output.pre
See doc. section 3.2, 10.1.1 |
| Convert the final alignment back to a HMM for use in searching databases.
This task is conducted iteratively in the target99 script, but the final
model file is not retained. The model file, foo.mod, is input
for a variety of SAM operations, and also contains within it an accounting
of the statistics involved in its creation. These scripts adjust
the weights assigned for inertions, deletions, and replacements consistent
with the data in the alignment and rules provided by a "regularizer".
They do not attempt to improve the alignment to be consistent with the
rules in the regularizer.
The model building scripts available in sam3.4 are w0.5, w0.7,and w1.0. Generally use w0.5 for divergent comparisons, and w1.0 for making family logos, and w0.7 for purposes in between. The legends in the SAM documentation have it reversed from its correct sense. |
Choose a model building script suitable for your specific purpose from
those listed in doc. section 4.9, eg:
fw0.5 <input alignment> <output name>
Or use all default parameters: modelfromalign <output name> -alignfile <input alignment> |
| Build ("train") a model. Iterates between extracting weights from an alignment as above, and readjusting the gap positions in the alignment consistent with those weights. This program has many, many parameters, and is best left under the control of the target99 or target2k script. | buildmodel <output name> -train <input.a2m>
See doc. section 3.1, section 9 |
| Convert a model file to ASCII. The most common reason
would be to read the header and discover what regularizer had been used
to make the mode. Note: we renamed this script to avoid confusion
with a script of the same name supplied with HMMER. It actually converts
the normal binary model format to ASCII, or an ASCII formated file to binary.
Note: we changed its name to avoid confusion with a script by the sam name in the hmmer package. |
hmmconvert-sam <new model name> -model_file foo.mod |
| Use an HMM model to search out the best matching sequences in a
database. The database is provided as a multi-fasta file.
The program is two slow to search through the full nr. target99 uses
this program several times, but uses a blast pre-filter to limit the size
of the database searched to <10,000 sequences. This would typically
be used by a user to search the sequences found by target99 to trim off
ones from the fringes. It could alternatively be used to measure
the closeness of sequences from one family to the model representing a
related family, or to score a domain model within a larger alignment. In
reverse search mode, it is used to search a small number of sequences against
a library of models. Significant options are:
-sw 0 (whole model to whole sequence; the default) -sw 1 (submodel to sequence matching) -sw 2 (submodel to subsequence) - usual -sw 3 (model to subsequence) - for domain searching -calibrate 1 (Causes proper calibration of the E values) -modellibrary <file1,file2...> (instead of -i, searches agains an array of models concatenated into one or more files). A series of parameters can cause hmmscore to place sequences matching to within certain tolerances into an output file. See section 10.2.2 |
To calibrate a model, search a small library representative of negative
matches. My library for this purpose is /home/user/hardies/database/smhuman
Use same -sw intended for scoring prospective positives. hmmscore <output name> -calibrate 1 -i <input.mod> -db /home/user/hardies/database/smhuman -sw 2 The calibrated model is named <output name>.mlib. Now use the .mlib file to search the sequence or sequences to be scored. This could be an individual sequence, or a library of sequences of interest, perhaps enriched for positives by blast at low stringency, or another family suspected to be related. hmmscore <output name> -modellibrary <input.mlib> -db <input seqs.fa> -sw 2 Hits are saved in <output name>.dist Or to recover the alignment of the hits of E value < 0.1: hmmscore <output name> -modellibrary <input.mlib> -db <inputseqs.fa> -sw 2 -select_align 4 -Emax 0.1 This puts alignment in <output name>.a2m
|
| To align sequences to a model. This is typically used to make a revised alignment, given the model just made from a seed alignment. Unlike hmmscore, sequences will not be rejected due to poor matching. Can use -sw parameters. | align2model <output name> -i <input.mod> -db <input seqs.fa>
See doc. section 3.2, 10.1 |
| Calculate E value that two sequence families match each other. | Make alignment for each with with target99, and convert each to a HMM model. Score each set with the other model with hmmscore, average the "reverse" scores and convert to E value. See doc. section 4.2 |
| To limit definition of a sequence family to the level of pfam (ie., exclude coalescence into superfamilies or convergent evolution due to common fold) | Use target99 with -family option. See doc. section 4.3 |
| Predict secondary structure of a sequence aided by the existence of homologs. |
|
| Convet SAM HMM to HMMER HMM & vice versa | sam2hmmer <output name> -i sam.mod
hmmer2sam <inputfilename> <outputfilename> See doc. section 10.8.4 |
| To remove duplicate sequences from an alignment. Typically
the seed sequences are also found again by target99, and so appear twice
in the alignment. Significant options: -train seed.a2m (specifically
removes all seeds seqs).
-db seqs.fa (removes duplicates from an unaligned fasta file). |
uniqueseq <output name> -alignfile foo.a2m |