SAM

SAM is a system that applies Hidden Markov Matrix methods to the problem of analyzing protein familes.
Typical uses are:

SAM is the product of the UCSC Computational Biology Group and can be accessed at their SAM web site.
The new T08 server is at T08 SAM server.
Preliminary CASP 8 results.
See Casp 7 results.
For UTHSCSA users, we have implemented a local version at the UTHSCSA Bioinformatics Core Facility. (see program status).  The local implementation allows more flexibility in usage than is allowed through the UCSC server, whereas the UCSC server is set up for greater convenience for conducting certain standard jobs.


SAM documentation.
Test data files mentioned in the documentation are found in /usr/local/sam3.3.1/sam-demos
One can read some of the scripts for further documentation in /usr/local/sam3.3.1/bin
Some of the programs give usage information if started with an invalid command line, or the -help parameter.

For running local target99 jobs, particularly in a learning mode,  it is essential to capture the standard error output to a log file and read it.  See unix tips.
 

The target99 script.

The script target99 automates a series of operations by the program SAM to accomplish the common task of starting with a single sequence or trusted alignment and searching the nr database to progressively build an alignment of homologues.  By comparison to Psi-Blast, both Psi-Blast and SAM attempt to learn from closer homologues which amino acid residues should be considered acceptable at each position in more divergent homologues.  In contrast to Psi-Blast, SAM also attempts to learn from closer homologues and from a consideration of secondary structure where alignment gaps should be more acceptable in  more divergent homologues.  SAM attempts to optimize the multiple alignment it reports, whereas the multiple alignment reported by Psi-Blast is extremely crude.  Both SAM and Psi-Blast can find homologues too divergent for ClustalW to align, so the alignment produced by SAM is the one of the best one can get by an automated method on very divergent homologues.  Note however, that ClustalW or a related program T-Coffee may do better at aligning distant families, each of which is well populated.

Because of its more sophisticated search strategy, target99 may find distant homologues that Psi-Blast will miss.  However, target99 can also miss distant homologues that Psi-Blast will find for the following reason.  On each iteration where SAM uses a prior alignment to search for more distant homologues in the nr database, it confines its search to a subset of nr that matches the initial sequence by BLAST to within some loose limits (typically E < 300).  This is to reduce computing time.  So in the default target99 strategy, divergent homologues outside of that initial set will never be considered even though the model constructed by SAM might recognize them.  One can override this problem by use of the -homologs option to specify a multifasta file containing an alternative set of sequences to search instead of the set filtered out of nr by blast.  For example, one could take the alignment output by target99 in default mode and use it to seed a Psi-Blast search.  One could then download the set of Psi-Blast finds at some large E value and use it with the -homologs option in a subsequent target99 search.  Alternativley, one can use the alignment from a target99 job to seed another target99 job. Since the prefilter will admit sequences within E < 300 to any of the members of the alignment, more divergent homologues may now be admitted.  However, the target99 search slows down dramatically with a large alignment seed, and a limit of 10,000 sequences passing the prefilter is imposed for practicality's sake.

Target99 actually uses two progressively larger sets to build its alignment.  On the first pass it uses a set of sequences close to the seed sequence (typicall E< 5.e-4 adusted by parameter -blast_close; see doc section 4.8 for adjustable parameters).  On subsequent passes it uses the broader set mentioned above (adjusted by parameter -blast_e).  The nature of the search key and the data to be searched can be altered as follows.  In all cases the result would be an alignment in an output file named foo.a2m.

target99 uses

We haven't identified specific differences in the target2k interface as of yet.  It has some differences in internal operation.
Variation Example
Basic target99 search to find homologues of a single seed sequence, and create an alignment.
If after the -seed parameter a fasta filename containing a single sequence is given, then close and expanded sequence sets derived from nr by blast will be used as described above.  A subsequent tuneup job (see below) is advisable to improve the alignment. NOTE: lower case characters in seed sequences or alignments are ignored.
target99 -seed single.seq -db $BLAST_DB/nr -out foo 2> foo.log
Basic target99 search to expand a family represented by a "trusted" alignment.  The trusted alignment could be from a preceeding target99 job followed by a tuneup, or from pfam, or from a collection of homologues derived by some other method and aligned, say by culstalW.  If after the -seed parameter an alignment file is given (in multifasta, or a2m format), then that alignment will be used without revision to key the search which is otherwise as above.  [All sequences are used to do the Blast prefilter search and the union is taken. If the seed alignment is very large, the blast searching and the size of the prefiltered files may become prohibitive.] target99 -seed align.a2m -db $BLAST_DB/nr -out foo 2>foo.log
If -seed <alignment> and -force_seed 0 is given, then it is as above only the sequences from the seed are realigned at each iteration.  In most cases, one would run the basic target99 job separately and then tuneup instead of trying to compress both processes into one job like this. target99 -seed align.a2m -force_seed 0 -db $BLAST_DB/nr -out foo 2>foo.log
Tuneup: If -seed <alignment> is given and -tuneup, then the seed alignment is realigned, but no other searching is done.  The seed alignment file has to be aligned within reason to start with. To align a set of sequences that are completely unaligned, use the -homologs option with  -seed (below) target99 -seed align.a2m -tuneup -out foo 2>foo.log
To extend alignment into regions that had orginally been ignored and demoted to lower case.  Do a tuneup job as indicated above.  The tuneup may also demote additional sequences to lower case.
Searching a defined set of sequences.  This is a way to force candidates for distant homologues past the prefilter.  The set could be all sequences found by multiple rounds of Psi-blast, or all sequences returned by Entrez by a particular definition (like "kinase"), or all sequences of a particular taxonomic kind (like "viruses"[orgn]), or sequences found by Blast at E<300 for a subset of the already aligned sequences suspected of belonging to a poorly represented distant related family.  Only the set specified by -homologs is searched. nr is not searched. target99 -seed  align.a2m  -homologs seqs.fa -out foo 2>foo.log
Using predefined close and far homologues.
-close <fasta file>, used together with -homologs <fasta file> causes the set of sequence named with -close to be searched in the first iteration, and the set of sequences named with -homologs to be searched for subsequent iterations.  No nr database search is done.  This is essentially like the basic target99 search except the user controls which sequences are considered on the first round, and which are considered thereafter.
target99 -seed  single.seq -close some_seqs.fa -homologs more_seqs.fa -out foo 2>foo.log
Aligning unaligned sequences.
If -seed is omitted, but -close or -homologs is given, then SAM attempts to align the unaligned sequences.  ClustalW or T-Coffee may be a little better at this job.  All of the methods have trouble if there are long unalignable sections interspersed with conserved motifs. For difficult alignments, try FIMs to mask unalignable regions (below).
target99 -homologs some_related_sequences.fa -out foo 2>foo.log
Masking regions.  Conserved domains separated by poorly conserved regions are most easily discovered in a given seed sequence by doing many rounds of PsiBlast at the NCBI server and looking at the graphical result display.  Most often, one is advised to develop the family for each domain separately.  However, if one wants to develop the family while keeping the domains together, then in the seed sequence edit a single "O" in place of each unconserved segment to be masked.  This is called a FIM.  See doc section 8.5.  Equivalently, you could demote the region to be masked to lower case.  FIMs are typically placed on both ends of a domain seed, causing there to be no end gap penalty.  If you want to adde FIMs to a pre-existing alignment, add them to a seed sequence and redevelop the alignment with -homologs on the .a2m file carrying the previous alignment.
Using target99 to build the trusted alignment. If starting from a pfam family, or other family supplied without the NCBI master gi numbers, the following trick will rebuild and tune up the alignment out of NCBI sequences.  This has the advantage of avoiding accumulation of unrecognized duplicate sequences within the alignment.  First, convert the pfam alignment to a multifasta file.  Use that as a seed to a target99 job with a very stringent E value cutoff for the Blast prefilter.  The resulting alignment will have just the original seed sequences and their equivalents from nr.  Then use the uniqueseq utility to subtract the original sequences. target99 -seed pfam.fa -db $BLAST_DB/nr -out foo 2>bar.log

uniqueseq <newalign_name> -alignfile foo.a2m -train pfam.fa

To search a library subset too big for -homologs:
Instead of searching nr, you may substitute a customized database formatted for blast searching.  The custom database would be first represented by a fasta file. This might be downloaded from Entrez or Batch-Entrez at NCBI, or retrieved from the local nr library based on a file containing the list of gi number.  This should be formatted by formatdb, and the fasta file and the formatted files should remain in the same directory.  The Blast prefilter would then be more sensitive to divergent homologues, and it might be possible to open it up further because of the reduction in noise.  If the database is not so large as to require the Blast prefilter, then give the fasta file with the -homologs option and formatting for Blast will not be necessary.
fastacmd -d $BLAST_DB/nr -i <filename with gi list> -tT -o custom_lib

formatdb -i custom_lib -oT -pT

target99 -seed {single.seq or align.a2m} -db custom_lib -blast_e 1000 -out foo 2>foo.log

Notes concerning the local implementation.

The following individual programs are either used as part of the target99 and may be used individually, or are utilities that may be used in conjunction with target99 jobs.

 
Task Do
Convert an alignment (input.a2m file) to a "human readable" (meaning any program other than SAM) form.  The .a2m file created by SAM demotes some some segments considered as insertions to lower case.  For other programs to read the file in alignment, either the lower case characters must be removed, or gaps must be added to the other files to balance them.  The program prettyalign adds periods to balance the lower case characters.  The output file is interleaved.  If the header of the output file is replaced with the line CLUSTAL W, followed by a blank line, then it can be read into most other programs as a CLUSTAL formatted (or .aln) file.  Significant options:
-f  -- creates a fasta file for output
  • It adds a header listing the deflines that may be useful in its own right, buy may need stripped off for input to other programs.
  • It adds a superfluous . to the end of each sequence
-m0  -- removes lower case characters
-m4 -- makes inserts to 4 characters, eg a 80 character insert starting with a and ending with v would be inserted as a80v and balanced by .... in the other sequences.
prettyalign input.a2m >output.pre

See doc. section 3.2, 10.1.1

Convert the final alignment back to a HMM for use in searching databases.    This task is conducted iteratively in the target99 script, but the final model file is not retained.  The model file, foo.mod, is input for a variety of SAM operations, and also contains within it an accounting of the statistics involved in its creation.  These scripts adjust the weights assigned for inertions, deletions, and replacements consistent with the data in the alignment and rules provided by a "regularizer".  They do not attempt to improve the alignment to be consistent with the rules in the regularizer. 
The model building scripts available in sam3.4 are w0.5, w0.7,and w1.0.  Generally use w0.5 for divergent comparisons, and w1.0 for making family logos, and w0.7 for purposes in between.  The legends in the SAM documentation have it reversed from its correct sense.
Choose a model building script suitable for your specific purpose from those listed in doc. section 4.9, eg:

fw0.5 <input alignment> <output name>
for sam3.4:
w0.5 <input alignment> <output name>

Or use all default parameters:

modelfromalign <output name> -alignfile <input alignment>

Build ("train") a model.  Iterates between extracting weights from an alignment as above, and readjusting the gap positions in the alignment consistent with those weights. This program has many, many parameters, and is best left under the control of the target99 or target2k script. buildmodel <output name> -train <input.a2m>

See doc. section 3.1, section 9

Convert a model file to ASCII.  The most common reason would be to read the header and discover what regularizer had been used to make the mode.  Note: we renamed this script to avoid confusion with a script of the same name supplied with HMMER. It actually converts the normal binary model format to ASCII, or an ASCII formated file to binary.
Note: we changed its name to avoid confusion with a script by the sam name in the hmmer package.
hmmconvert-sam <new model name> -model_file foo.mod
Use an HMM model to search out the best matching sequences in a database.  The database is provided as a multi-fasta file.  The program is two slow to search through the full nr.  target99 uses this program several times, but uses a blast pre-filter to limit the size of the database searched to <10,000 sequences.  This would typically be used by a user to search the sequences found by target99 to trim off ones from the fringes.  It could alternatively be used to measure the closeness of sequences from one family to the model representing a related family, or to score a domain model within a larger alignment. In reverse search mode, it is used to search a small number of sequences against a library of models. Significant options are:
-sw 0 (whole model to whole sequence; the default)
-sw 1 (submodel to sequence matching)
-sw 2 (submodel to subsequence) - usual
-sw 3 (model to subsequence) - for domain searching
-calibrate 1 (Causes proper calibration of the E values)
-modellibrary <file1,file2...> (instead of -i, searches agains an array of models concatenated into one or more files).  A series of parameters can cause hmmscore to place sequences matching to within certain tolerances into an output file. See section 10.2.2
To calibrate a model, search a small library representative of negative matches.  My library for this purpose is /home/user/hardies/database/smhuman 
Use same -sw intended for scoring prospective positives.

hmmscore <output name> -calibrate 1 -i <input.mod> -db /home/user/hardies/database/smhuman -sw 2

The calibrated model is named <output name>.mlib.

Now use the .mlib file to search the sequence or sequences to be scored.  This could be an individual sequence, or a library of sequences of interest, perhaps enriched for positives by blast at low stringency, or another family suspected to be related.

hmmscore <output name> -modellibrary <input.mlib> -db <input seqs.fa> -sw 2

Hits are saved in <output name>.dist

Or to recover the alignment of the hits of E value < 0.1:

hmmscore <output name> -modellibrary <input.mlib> -db <inputseqs.fa> -sw 2 -select_align 4 -Emax 0.1

This puts alignment in <output name>.a2m
See doc. section 3.4, 10.2
 

To align sequences to a model.  This is typically used to make a revised alignment, given the model just made from a seed alignment.  Unlike hmmscore, sequences will not be rejected due to poor matching.  Can use -sw parameters. align2model <output name> -i <input.mod> -db <input seqs.fa>

See doc. section 3.2, 10.1

Calculate E value that two sequence families  match each other. Make alignment for each with with target99, and convert each to a HMM model.  Score each set with the other model with hmmscore, average the "reverse" scores and convert to E value.  See doc. section 4.2
To limit definition of a sequence family to the level of pfam (ie., exclude coalescence into superfamilies or convergent evolution due to common fold) Use target99 with -family option.  See doc. section 4.3
Predict secondary structure of a sequence aided by the existence of homologs.
  1. Choose a key sequence (key.seq), and a fasta file of homologs (seqs.fa)
  2. Create the alignment (seqs.a2m), and model (seqs.mod) using above tools.
  3. Perform secondary structure prediction on individual sequences by method of choice.
  4. Create a multifasta file corresponding to seqs.fa where each character is replaced by E, H, or L (sheet, helix, loop) (seqs.2d).
  5. predict_track <output name> -sw 2 -subtractnull 4  -i seqs.mod -a protein,EHL -db seqs.fa,seqs.2d -predictseq key.seq
  6. Note: If the individual secondary structures are predicted in a way that has more features (say turns), you should substitute the the appropriate alphabet for the EHL parameter.  (Doc section 7.1)
See doc section 10.7
Convet SAM HMM to HMMER HMM & vice versa sam2hmmer <output name> -i sam.mod
hmmer2sam <inputfilename> <outputfilename>

See doc. section 10.8.4

To remove duplicate sequences from an alignment.  Typically the seed sequences are also found again by target99, and so appear twice in the alignment. Significant options: -train seed.a2m (specifically removes all seeds seqs).
-db seqs.fa (removes duplicates from an unaligned fasta file).
uniqueseq <output name> -alignfile foo.a2m

Index into the documentation



Last update 6/4/2004