t_coffee
Installed version is 2.03.
-
Version 1.37 also available by putting /usr/local/t_coffee137/bin first
in $PATH (ie. export PATH=/usr/local/t_coffee137/bin:$PATH)
-
Version 1.37 used for certain functions that are defective in version 2.03
(which is still in beta release).
-
Executables for 2.03 are in /usr/local/bin (should be automatically accessed
by default user profile, unless version 1.37 was put before /usr/local/bin
in $PATH), but also can be specified by putting /usr/local/t_coffee203/bin
first in $PATH
-
Installation is as described on bioinf and bcf 4/8/2005
-
SAP, and fugue interface described in documentation are not installed.
Source of program: http://igs-server.cnrs-mrs.fr/~cnotred/Projects_home_page/t_coffee_home_page.html
Authors: Cedric Notredame, Desmond G. Higgins and Jaap Heringa
There are sample input files in /usr/local/t_coffee203/examples/seq/,
that were provided along with the installation.
Documentation:
-
documentation
-
Also found in /usr/local/t_coffee203/doc
-
paper
-
See also the paper cited at the web server.
-
web
server (http://igs-server.cnrs-mrs.fr/Tcoffee/tcoffee_cgi/index.cgi)
Description: t_coffee performs a number of advanced functions
related to multiple sequence alignments. This document will
deal with protein sequences, but t_coffee is also used for alignment of
nucleic acid sequences directly, or by virtue of translating on the fly
to protein sequence.
-
The most rudimentary applicaiton is to try to improve on clustalw in
aligning multiple sequences. To do this t_coffee will accept
a fasta file of unaligned sequence, employ clustalw and some fasta-like
methods to determine a library of plausible alignments, and then apply
its own algorithm to estimate an optimal alignment.
-
t_coffee can also evaluate an alignment without realigning, highlight
discrepencies between two alignments, or find the best compromise among
two or more different pre-existing multiple alignments. Some
examples for doing these things in the locally installed t_coffee appear
below. The instructions in the documentation (above) are sufficiently
obtuse that you would be well advised to try the web server (above) first.
If your needs exceed the capacity of the web server, a smaller job of like
kind submitted there will reveal the command line syntax used by the web
server. However, beware that there appear to be command line syntax
differences among different t_coffee versions.
-
Although the authors describe the improvement over clustalw as "dramatic",
quantitative comparison indicates a modest average improvement of only
a few percent (see http://www.drive5.com/muscle/,
and papers cited therein; this site desrcribes a faster program of comarable
accuracy but without the advanced features named MUSCLE). However,
there are examples where t_coffee aligns an obvious motif whereas clustalw
missed it. For example, see the tutorial on multiple alignment given
at (http://www.dina.kvl.dk/~sestoft/tmp/exercise-multiple.html).
In general, t_coffee (and clustalw) lose power at divergence below the
point where you can easily recognize motifs by inspection. As a rule
of thumb, the useful range for these programs ends somewhere about
25% amino acid identity.
-
However, t_coffee has the useful feature that it can be guided by outside
information about the alignment of some of the more divergent members
of the family, and then focus its efforts on aligning each subgroup about
its prototype. The outside information is usually in the form of
a structural alignment of divergent members having structural data.
However, other forms of outside information can be accepted by the program.
For example a subjective alignment of particular motifs thought to be important
by the operator could be used to guide the overall alignment. Several
such sources of information can be incorporated at the same time.
Examples and commentary:
-
Ab initio alignment:
-
The file 1ajsA_ref3.pep,
provided with the sample data has 28 unaligned sequences homologous to
aspartate aminotransferase ranging from 14% to 60% identity. These
are in four groups (clades), each containing a member with a known crystal
structure.
-
A crude display of the pairwise % identities comes to the screen (actually
stderr) during the alignment in version 2.03, but not 1.37.
-
The screen output may be captured to a file by either adding the flag -quiet=filename,
or adding the pipe command 2>filename
-
Other ways of viewing the distribution of percent identity are given below.
-
The simplest syntax to attempt an alignment is t_coffee 1ajsA_ref3.pep.
-
This is equivalent to t_coffee -infile=1ajsA_ref3.pep -align -output=clustalw
-run_name=1ajsA_ref3
-
These commands assume you have copied 1ajsA_ref3.pep from the t_coffee
sample directory to your own working directory.
-
The result is output of an alignment in clustalw format (although other
formats can be specified): 1ajsA_ref3.aln.
-
The same job submitted to version 1.37 gives a different result.
The version 2.03 alignment was plainly more accurate.
-
Somewhat confusingly, adding the flag -clean_aln=1 to the
1.37 job caused the alignment to become much more like the 2.03 alignment,
whereas adding -clean_aln=1 to the 2.03 job caused no change.
-
-clean_aln=1 is supposed to turn on a secondary realignment of sequences
that are rated as poorly aligned after the initial alignment.
-
Both versions 1.37 and 2.03 list -clean_aln as 0 (off) by default.
-
In version 1.37, I have observed the secondary realignment to occur by
default under circumstances where the input alignment was not supposed
to be altered at all, thus completely confusing the result. Hence, I recommend
always running with -clean_aln=0 or 1 set explicitly.
-
One of the output files is 1ajsA_ref3.dnd. View this file as a phylogram
with treeview.
It looks like this.
-
Although a treeview version for linux is distributed, we have not yet been
able to make it work on bcf/bioinf. The MS windows version installs
easily, so the easiest thing to do is to scp the .dnd file to your Windows
machine and run Treeview on it there.
-
The scale bar indicates the branch length corresponding to 10% divergence
(uncorrected for multiple hits).
-
You can see that there are basically 3 groups of sequences included that
are likely to be alignable within group but not between groups by the methods
thus far employed.
-
A more exact representation of the pairwise similarity can be found by
converting the fasta file to a blast library and then searching it with
representative sequences.
-
dbformat -i 1ajsA_ref3.pep -pT -oT
-
blastpgp -i 2dkb.fa -d 1ajsA_ref3.pep -j6 -o 2dkb.blastout
-
This reveals that even among sequences separated by 50-60% divergence,
the last 100 residues may be difficult to align.
-
The alignment can be color coded according to confidence:
-
t_coffee 1ajsA_ref3.pep -output=score_html,clustalw
-
produces: 1ajsA_ref3.score_html
-
Note from the color code that t_coffee doesn't put much confidence in this
alignment.
-
This is a measure of consistency among several methods to align each segment.
Poor scores indicate that more than one equally good possible alignment
has been found. In this case, the presence of poor alignment among
different groups can be expected to generally degrade the scores across
the alignment.
-
To see how a reliable alignment scores, carve out just the sequences closely
related to 1ajsA and align: t_coffee group1.pep -evaluate_mode=t_coffee_non_extended
-output=score_html producing this.
-
The -evalulate_mode chosen is one of several available. This
one seemed the least conservative, and therefore most appropriate for an
alignment that is very solid looking.
-
Repeating the operation with one of the distant sequences added produces
this.
-
In this case, the color coding clearly flags the single unreliably aligned
sequence.
-
The few spots in the new sequence that rate in the "good" range are actually
also completely wrong, as judged by the available structural alignment
(see below).
-
t_coffee as a simple alignment format converter:
-
Inclusion of the -convert parameter cancels the alignment and just
converts input to the specified output format.
-
Example: to convert a clustalw .aln file to an .msf formatted file: t_coffee
file.aln -convert -output=msf
-
Input format is autodetected.
-
In the above example, the .msf file inherits the root filename of the input
file. To specify a different filename for the output file: t_coffee
file.aln -convert -run_name=newname -output=msf produces newname.msf.
-
The msf file is successfully read by PAUP and converted to a nexus file
by the paup command (typed to the paup command interpreter): tonexus
fromfile=file.msf tofile=file.nex format=GCG; This series
(after cleaning up the SAM prettyalign header with vi) is the only command
line conversion in the bioinf/bcf system that I know of that actually works
to provide a path from SAM to PAUP.
-
According to the manual; -output=gcg is the same as -output=msf;
-output=clustalw
produces a .aln file; -output=fasta produces a fasta file with gaps
in place; -output=pir produces a multisequence pir file with gaps
in place.
-
Structure alignment.
-
4 of the sequences (the ones starting with numbers) have known 3D structures.
-
A separate fasta file was created containing just those 4 sequences.
-
The structure alignment was done at the T-coffee web site to illustrate
how to use the web site to learn how to set up a job, and because we have
not yet installed SAP and the client for fugue that would allow the local
t_coffee implementation to call on these resources.
-
Returned items:
-
The command
line created by their web server to run the job in their system.
This gives some idea of the complexity of the command that would have to
be issued in the local system.
-
There are actually several commands separated by semicolons. The
preparatory commands to set the working directory and some environment
variables are not relevant. Neither is the final command to clean
up the working directory. Also the long paths to the working directories
used by the web server and the elaborate name assigned to the input file
are not relevant to how one would set up a local implementation.
Also the > and 2> phrases are to capture output to a file, which are not
necessary in a locally submitted job. The clustalw command spawned
by the program is probably set up automatically by t_coffee.
-
All the paramters specified after the t_coffee command, on the other hand,
would have to be understood and used correctly to create a local job.
In particular, there are calls to both structure alignment programs SAP
and fugue, and to plain sequence alignment programs. The latter are
presumably necessary to provide some alignment (even if probably wrong)
in surface loops where structure alignment is unlikely to be informative.
However, it is unclear how the poor alignment information has been weighted
with the structure information where the structures actually align.
-
There is also a log
file
that contains more information about how the various parameters have been
set (including by default).
-
The alignment is returned in several formats.
-
The clustalw
aln format would be saved to act as outside information in a subsequent
larger job.
-
The score_html
file reveals the confidence estimated by t_coffee in its alignment.
-
The associated ESPript
pages give various representations in alignment with the secondary structure
elements of the proteins. This display of secondary structure associated
with the scored alignment illustrates that some secondary structural elements
that surely must be aligned with certainty are marked in lower confidence.
The inclusion of ordinary sequence alignment methods is probably obscuring
the quality of the structure alignment in these areas. At this point,
you should go down to the SGI in Dr. Demeler's office and run an Insight
II homology modeling job on these same 4 structures. That will give
you a much more visual picture of which residues are structurally aligned
and which are not. Something more like the picture you will get from
insightII appears if we omit all methods except SAP and fugue from the
structure alignment job. Result.
-
Scoring agreement between two different alignments.
-
Insepction reveals that the 4 sequences in the structure alignment made
above are not generally consistently aligned with the ab initio alignment.
-
The color coding scheme can be used to score one alignment with respect
to the other.
-
First one alignment must be made to a library that will be used
to score the other alignment.
-
Version 2.03 is defective in both library making and scoring, so for these
operations version 1.37 is used.
-
Implement version 1.37 by export PATH=/usr/local/t_coffee137/bin:$PATH
-
Carry out commands requiring 1.37 as described below.
-
Libraries produced by 1.37 have a formatting defect (a missing exclaimation
point) on the penultimate line. Use vi to make the last two lines of the
library look exactly as follows:
-
The syntax for making the library is t_coffee -in=str.aln -convert -weight=10000
-out_lib=str.tc_lib
-
Contrary to the documentation, the -in specifier is necessary. Otherwise
the program contaminates the library with additional wrong alignment information
that it generates with weaker methods.
-
The -weight=10000 specifies that all positions in the structural
alignment will be considered as correct for the purpose of scoring the
other alignment. This maneuver hijacks the weighting system that
is usually used for the quality of the match to now just mean consistency
with the structural alignment.
-
The command t_coffee -infile=A1ajsA_ref3.aln -score -in=Lstr.tc_lib
-output=score_html -run_name=evstr -clean_aln=0 -evaluate_mode=t_coffee_non_extended
produces: evstr.score_html
-
The -score flag is supposed to tell the program to just score and
not align, but this is a case where -clean_aln=0 is necessary to
cut off secondary realignment that interferes with the intent of the calculation.
-
The -evaluste_mode=t_coffee_non_extended flag is necessary, otherwise
averaging of the scores over multiple residues is carried out and obscures
much of the detail.
-
The A and L prefixes on the input alignment and the library are supposed
to supplement t_coffee's autodetection of what kind of files these are.
It is unclear when they are needed. You should generally avoid filenames
beginning with capital letters when using t_coffee because of the obvious
potential for confusion with prefixes.
-
This operation clearly brought out that there are two (red) blocks where
the ab initio alignment got the structured sequences correct. There
are also places (yellow) where two of the sequences were aligned consistent
with the structural alignment, and places (orange) where 3 of the sequences
were consistently aligned. In principle, places were SAP gave poor
support in the structural alignment could be correct in the ab initio alignment.
However, in practice, places where structure alignment fails are probably
not alignable by any method.
-
This display is trying to represent the quality of 6 different pairwise
comparisons at once, and so obscures what the consistency is between any
2 particular sequences. To ask a more focused question "was the alignment
of 2gsaA and 2dkb in 1ajsA_ref3 consistent with their structural alignment?"
do the following.
-
t_coffee str.aln -convert -output=fasta -run_name=2strs to make
a file more easily edited by vi to remove the other structured sequences.
-
After removing the two other sequences, make the library 2strs.tc_lib as
before and use it to score 1ajsA_ref3.aln producing this.
-
Similarly, converting the group1 alignment to a library and scoring 1ajsA_ref3.aln
with it reveals only a few spots where the attempt to align across the
deep splits has altered the alignment within this group. Result.
-
Log out and log back in to reset the t_coffee version to 2.03.
-
Hopefully an update of version 2 will appear that eliminates the need to
fall back to version 1.37 for this operation
-
Fusing two alignments with t_coffee
-
Consider the task of adding the 3 other structured sequences to the group1
alignment. In this case we already have all the information necessary
to determine the result. All sequences in group1 should retain exactly
their relationship to 1ajsA as they do in group1.aln, and the other 3 structured
sequences should retain exactly their relationship to 1ajsA as they do
in str.aln.
-
Libraries with -weight=10000 are made for each of the alignments.
-
Then fuse them by t_coffee group1.aln -in=Astr.aln -in=Lgroup1.tc_lib
-in=str.tc_lib -clean_aln=0 -output=clustalw,score_html -run_name=group11
-evaluate_mode=t_coffee_non_extended
-
By scoring the new alignment with the two respective libraries, you will
see that this did the trick.
-
You can then imagine adding the other groups one at a time, or trying to
fuse them all in one operation.
-
One caveat is that there is other alignment information being generated
in this operation. The weight of 10000 on the information provided
in the libraries is supposed to overpower any conflicting information generated.
However, if there are enough sequences creating (potentially wrong) pairwise
alignment information, then the weights may have to be increased to maintain
their effects. The documentation suggests the number of sequences
in the total alignment squared times 1000. It is unclear if there
is some upper limit before the program misinterprets the number.
-
For large families, this structure with only single representatives outside
the immediate clade added for context is a useful stopping point.
-
Once could well imagine a simpler program to achieve this kind of fusion,
since tree making, substitution matrixes, and dynamic programming are all
irrelevant to the sought after result.
-
Joint structure/sequence alignment.
-
Joint sequence structure alignment is different in spirit than the fusion
described above. The idea is that the struture information should
hold forth where it is strong, but that sequence conservation should outweigh
structure when sequence conservation gives a stronger signal. This
is an extension of t_coffee's original purpose, which was to allow other
methods to overrule clustalw but only where those methods gave a stronger
signal.. t_coffee always tries to do this. In the case of the
fusion described above, we gave it some strongly and uniformly weighted
information to overpower all other information. For a true joint
alignment, we should have given it the variably weighted library created
by SAPS and let the algorithm balance those weights against those created
during its ab initio calculation.
-
Since we haven't implemented SAPS yet, we have no way to recover the variably
weighted SAPS library.
-
The variably weighted SAPS library could be simulated by cutting the structure
alignment apart to several well aligned blocks, and trimming the poorly
scored segments of structure alignment away. These blocks could be
each made to their highly weighted libraries and added all at once to an
ab initio alignment. Hence they would force the issue in the regions
that they covered, but leave t_coffee to its own devices in other regions.
This is also how one would constrain t_coffee to align certain motifs that
had been recognized in the family, even in the absence of structural information.
-
One could alter the average amount of weight given a particular variably
weighted library as follows:
-
Align the group in isolation and output its variably weighted library using
-out_lib.
-
Write a small program to proportionately increase the weight in column
3 of the library file to taste.
-
Add back the upweighted library to the global alignment the same way the
libraries are added above.
-
Of course, you can throw your entire set of sequences into the web server
and hope for the best.
-
It will automatically look up those sequences with pdb-like names and do
a structural alignment in conjunction with its global alignment.
-
The scoring methods described above will give you the power to ask if the
resulting alignment meets your expectation with respect to alignment within
well related groups, and with respect to conserving blocks well defined
by the pure structure alignment.
-
If the default parameters for balancing weights at the web site do not
produce the result you want, then there is limited flexibility to make
adjustments there. You can add or subtract from their list of methods
applied. And you can carve up your sequences to subsets designed
to avoid an overburden of highly divergent pairwise comparisons in each
individual job.
Seaview integration:
-
The alignment editor named seaview, implemented on bioinf/bcf, has the
option to select subsections of an alignment (both by coordinates and by
selection of sequences) and spawn a realignment. By default, clustalw
will be used. Instructions to direct the alignment to be conducted
by t-coffee are:
-
By default you get clustalw for alignment. If you want t-Coffee instead:
-
Make sure the working directory is first in your path. ie. export PATH=.:$PATH
-
Copy /usr/local/bin/seaview_align.sh-t_coffee to your directory and rename
it seaview_align.sh
-
Alternatively put your root or your own bin directory first in path, and
put seaview_align.sh there.
-
To revert to using clustalw instead of t-coffee, rename or delete the seaview_align.sh
file.
-
To exercise greater control over the t-coffee parameters, edit the file
seaview_align.sh. On the line t_coffee $args, follow with additional
parameters you wish to use (eg. constraint by a library generated from
a structural alignment).
-
Seaview piecewise alignment might be used to cause t_coffee to force alignment
of some motif into an otherwise acceptable alignment, or to repair deviation
from the structural alignment in an otherwise acceptable alignment.
-
This strategy might allow application of t_coffee to optimize very large
alignments that it can not handle in a single session due to memory or
execution time limitations.
Accompanying utilites: Unclear which of these are called by other parts
of the system, or whether they are correctly installed, or how to use them
if they are stand alone utilities.
-
blast_aln2fasta_aln.pl
-
Converts a blast #6 format output file to clustalw format.
-
Could be used to influence an alignment to adhere to Psi-Blast results,
where available.
-
The resulting alignment will only have aligned segments of the sequences
represented.
-
msf_aln2fasta_aln.pl
-
fasta_aln2fasta_aln_unique_name.pl
-
This script reads an aln in fasta format and modifies names so that there
are no duplicate names
-
If two sequences have the same name, the second one is renamed name_1 and
so on