Molecular evolutionary theory, as related to protein family analysis and
choice of sites for in vitro mutagenesis.
Conserved residues:
Most
students understand intuitively that functionally important residues may
not be able to change much during evolution, and hence residues that don't
change may be good targets for in vitro mutagenesis. In a
finer sense, given a variety of residues implicated by structural data
to interact with the substrate, the ones that do not change in the family
would seem to be essential in their function, while the ones that do change
might be either irrelevant or may reflect functional distinctions among
the family members.
As
a control for this concept, pseuduogenes have been investigated.
Pseudogenes have lost function, and their residues demonstrate the evolutionary
behavior to be expected of residues that exhibit no conservation.
Pseudogenes evolve at the neutral rate, which is adequate to saturate
(randomize) the identity of residues well within the divergence time
separating humans and other mammalian orders (such as rodents). Comparison
of the divergence rate of real proteins with pseudogenes reveals
the fraction of possible mutations that have been excluded by natural selection.
This measure of conservation varies substantially among different proteins.
Examples are:
| Protein |
% variations excluded by natural selection |
| RNAse |
50-70 |
| cytochrome c |
90 |
| Pyruvate Kinase |
97 |
| Histone |
99.7 |
Generally
speaking, the fraction of changes found to be functionally insignificant
after in vitro mutagenesis is much higher than the fraction judged to be
acceptable by natural selection. Natural selection is believed
to eliminate alleles with a selective coefficient as low as -0.001.
Biochemical assays are not nearly that sensitive. This discrepancy
between the sensitivities of natural selection and biochemical assay often
causes highly conserved residues to reveal puzzlingly little effect upon
in
vitro mutagenesis.
It
is also apparent that many of the residues that are observed to change
are none-the-less under considerable natural selection. Of course,
it is well recognized that chemically conservative residues changes are
better tolerated than non-conservative changes. The nature of the
genetic code is such that most one nucleotide changes are chemically conservative.
The numbers tabulated above are for one nucleotide substitutions, and hence
are weighted towards the chemically conservative class of amino acid replacements.
If it were simply true that residues exchanged freely within an acceptable
set of replacements, but were prohibited by natural selection from changing
otherwise, this kind of evolution too would saturate well within the divergence
time separating humans and mice (~85 million years).. This is clearly
not true. Observed increases in divergence occur over a much broader
range of divergence time, and can be used to construct evolutionary
trees spanning up to 2 to 3 billion years. This implies that
the residues showing partial conservation are prohibited from changing
at all in most time periods and then allowed to change in others.
Among the factors that might change from time to time are the identities
of the surrounding residues. This implies a high degree of non-additivity
in the impact of residue replacements on the fitness judged by natural
selection. Most of this implied complexity is apparently at the more
subtle level of functional revision, since the larger functional differences
measured after in vitro mutagenesis tend to be additive. The
functional effects restraining the lesser conserved residues are usually
at the subtle level sensed by natural selection but not biochemical assay,
and the intuitive sense that lesser conserved residues will not give much
of a functional difference during in vitro mutagenesis usually holds
true. But there can be exceptions where mutating a residue to a variant
found in other homologues produces a serious loss of function.
Evolutionary Tree:
An advantage of estimating the evolutionary tree for a protein family
under examination is that it can help clarify the difference between orthologues
and paralogues. The existence of paralogues within the same
genome implies functional specialization, and sets up in vitro mutagenesis
experiments for swapping residues between the paralogous proteins in order
to map how the functional distinctions are brought about. Additionally
it helps create objective subdivisions of the protein family called clades.
It may substantially clarify an in vitro mutagenesis plan to distinguish
between two residues that interchange freely at a position from two residues
that are each conserved within respective clades. The latter is a
promising swap that might reveal a functional distinction between the two
residues, whereas the former case is not.
There
are many methods for calculating trees from sequence alignments.
For protein alignments, the best starting point is a neighbor joining tree.
This is a relatively fast and reliable method that can be accessed at the
bioinformatics facility, and at various places on the web. It permits
a variety of corrections for saturation. The confidence in the divisions
of the tree should be evaluated with a bootstrap analysis. The bootstrap
analysis consists of sampling many random subsets of the alignment positions,
recalculating the tree, and then reporting for each clade what fraction
of the time all the members were consistently found to map in the clade
among those replicates.
Alignment:
Any statement about conservation of individual residues, whether intuitive
or through a formal phylogenetic analysis, relies on the sequences being
meaningfully aligned. This becomes increasingly difficult across
more divergent splits in the tree. A range of programs used for this
purpose are discussed in this alignment
help document. When alignments cross divergence that exceeds
the point where the validity is obvious to the eye, a more elaborate method
of verification is required. If there are 3D structures in the two
divergent clades, then one can ask if the sequence alignment recapitulates
the structure alignment. If not, there are alignment methods that
directly incorporate the structure alignment information. In the
absence of 3D structure information, one can predict secondary structure
averaged over the individual clades and ask if the sequence alignment sensibly
aligns the predicted secondary structure elements. A help
document for secondary structure prediction is available.
Neutral Drift versus Selection.
There
is a long standing argument in molecular evolution about the nature of
the changes that do occur. Neutral theory holds that most of them
occur simply because they do no harm. Selection theory holds that
most changes occur due to functional optimization driven by natural selection.
In order to explain the large number of changes over time, it is postulated
that the exact requirements on the function of the protein keep shifting.
Whichever theory is correct, it is clear that at the level of sensitivity
of biochemical assays most of the naturally occurring changes are effectively
neutral. This should influence your thinking about targets for in
vitro mutagenesis. For example, suppose you observe an alpha helix
with a proline in it. This is unusual, so perhaps you would propose
to do in vitro mutagenesis to discover why this alpha helix requires
being distorted by a proline. The answer may be "because it does
no harm". The case for investigating this residue would be stronger,
if you observed it to be conserved, at least within a clade on the tree.
Glossary:
-
Allele - a variation of a gene.
-
Change - in sequence comparison, an individual event in time when
a base or residue in a sequence was replaced by a different base or residue.
When referring to nucleic acid sequence, changes are called "substitutions".
When referring to amino acid sequence, changes are called "replacements".
Some authors refer to either kind of changes as substitutions. The
number of changes must be estimated from the number of differences by a
statistical model that accounts for multiple changes at a site appearing
as only one difference, or for a chain of changes leading back to the same
character and appearing to be no difference. There are algorithms
in use to make such corrections.
-
Clade, cladistic - all descendants of a common ancestor located
on a tree, including all ancestral states descendant to the given common
ancestor. A clade is a formal and objective version of a family.
In principle, a hierarchical set of clades can be defined with one per
branch point on the tree. In practice, a minimum number of clades
is usually chosen for discussion to capture the major divisions in the
tree. Sequences are said to be in separate clades if neither of the
two respective common ancestors are descended from each other.
-
Conservative selection, conservation, conserved - the state of a
sequence being maintained because loss of the gene would reduce the fitness
of the organism and such organisms would be eliminated by natural selection.
The term is applied to whole genes, but also to subsequences and even individual
bases or residues within a sequence.
-
Convergent evolution - for both organisms and sequence: a process
by which similarity is achieved independently without a common ancestor
due to selection for a common function.
-
Difference - a position in a sequence comparison where two sequences
are different. For the number of differences to be used as a measure
of divergence time 1) the sequences are meaningfully aligned, and 2) the
number of differences must be corrected to the number of changes.
-
Divergence - the process of accumulating differences during descent
from a common ancestral sequence. The term is applied both to sequences
and to organisms. For estimating divergence time, the number of differences
must be converted to the number of changes.
-
Divergence rate - the rate per unit time of accumulating changes.
-
Divergence Time - the time past since the existence of a common
ancestor. The term is applied both to sequences and to the organisms
carrying them. Note, however, that the divergence time for particular
sequences may not correspond to the divergence time for the organisms carrying
them. See orthologous, paralogous, and horizontal transfer.
Divergence times for some organisms have been estimated from the fossil
record. Other divergence time estimates for organisms are based on
estimating the divergence times for one or more of their orthologous genes.
-
Evolutionary tree - the time and order of speciation events leading
to a collection of species of organisms. More commonly, a model for
the true history of events created from some combination of the fossil
record and observations on the divergence of specific genes. For
genes, a model for the history of a collection of orthologues and paralogues
estimating the timing and order of speciations, and gene duplications,
and from which an estimate of the sequence of various ancestral genes might
be inferred. Due to the the difficulty in estimating the full
range of parameters associated with an evolutionary tree, there are several
less committal versions of a tree that are in common use:
-
Evolutionary tree, unrooted - a diagram showing the amount of divergence
and order of speciations (and duplications) of a set of organisms or of
a set of gene sequences. The unrooted tree does not commit to the
divergence rate having been the same on all branches, and does not identify
a root.
-
Cladogram - a rooted or unrooted tree drawn with attention to branch
order, but not with respect to the divergence time.
-
Dendogram - a diagram showing clustering of sequences according
to some criterion not necessarily well correlated with true evolutionary
tree. A dendogram is intended to summarize some data, not to comprise
the final estimate of the tree.
-
Family - For sequences: a collection of homologous protein sequences
related within some limit of similarity. Families are usually defined
for domains rather than complete polypeptides to avoid confusion from interdomain
recombination. The limit of similarity is imposed by practical considerations
related to avoiding the inclusion of false members (non homologues, or
paralogues, or orthologues conceived to merit a different family name)
in the family. The limit of similarity and hence the family boundary
may vary from one investigator to another, hence families are not true
objective constructs. Most typically, a family is defined as a collection
of sequences with enough similarity to be detected by Blast and convincingly
aligned by clustalw. Most typically, the core family defined in this
way is used to create a PSSM or HMM model and search for and align more
divergent homologues. These more divergent members are presented as possible
family members, creating a fuzzy boundary to the family. Homologues
that are too diverged to be convincingly aligned by the method in use by
the investigator may be organized into one or more separate families that
are said to all be related.
-
Fitness - For an organism, the overall effectiveness of surviving
and producing offspring due to all inherited traits. For a
gene, the contribution to overall fitness relative to a wild type allele.
Over time, the relative frequency of the allele in a population will rise
or fall depending on its fitness. The difference between the fitness
of the allele in question and the wild type allele is called the selective
coefficient. It is believed that a selective coefficient as low as
- 0.001 may be enough to eventually cause the less fit allele to disappear
from the population.
-
Gap - an unoccupied position in a sequence added by an alignment
algorithm in order to maximize similarity. Gaps must be assigned
a gap penalty during sequence alignment. Otherwise it is always possible
to align to 100% identity by adding enough gaps. Gap penalties most simply
consist of a fixed penalty to place a new gap and an incremental penalty
to enlarge it. Gaps at the ends of sequence may receive special consideration.
-
Homology, homologues, homologous- the state of having descended
from a common ancestor. Sequences are generally stated to be homologous
if they share similarity too great to be explained by another process such
as random chance or convergent evolution.. Strictly, homology is
either true or false, and does not imply that the homologous sequences
must have any particular degree of similarity. In common (and incorrect)
use, x percent homology means homologous and having x percent similarity.
Sequences should not be concluded to be non homologous from a lack of detectable
similarity. Rather it should be said that "homology was not detected".
-
Horizontal transfer. - A transfer of genes from one organism to
another by a means other than the usual replicative process for the organism.
Examples are movement of retroviral genes from the germline of one species
of mammals to another through an infective process, and transfer of genes
between bacterial species by conjugation. Horizontal transfer is
generally inferred when the divergence time of particular genes between
two organisms is clearly less than the divergence time of the organisms.
-
Identity, % identity - the percentage of bases or residues that
are identical in an alignment between two sequences. Percent identity
is only meaningful in the context of an alignment for which similarity
was maximized. Thereafter, the percent identity is the simplest tangible
measure of the degree of relatedness. 20% identity for proteins, and 70%
identity for nucleic acids are thresholds below which simple two way comparisons
become unreliable.
-
Ingroup - the sequences on an evolutionary tree not in the outgroup.
Generally, the ingroup contains the sequences of interest, and the outgroup
contains more divergent sequences added to the tree to clarify the position
of the root within the ingroup.
-
Invariant - a position in a sequence that never changes, implying
a high degree of conservative selection. There are very few truly
invariant residues in protein sequences. In practice, invariant positions
are those that did not change within a particular collection of homologues.
-
Matching matrix, substitution matrix - a difference in an
alignment specified at the level of which character was aligned with which
other character. Different mismatches may carry different alignment
penalties depending on evidence that the characters are more or less interchangeable
in nature. For example, an Asp - Glu mismatch would be assigned a
small penalty, whereas and Asp - Phe would be assigned a large penalty.
The mismatch penalty matrix is a table specifying penalties for all possible
mismatches. In practice, the matrixes are computed by counting the
numbers of each possible mismatch in a large set of sequences thought to
be well aligned (training set), rather than deducing them from chemical
principles. Mismatch matrixes for nucleic acid alignment generally
reflect a greater propensity for transitions (C - T, A-G) than transversions
(all other changes). The matrix giving the same penalty for all mismatches
is called the identity matrix. Depending on the degree of divergence
permitted among the sequences in the training set, different penalty matrixes
are derived. The matrix corresponding to the degree of divergence
in the experimental sequences is expected to perform more meaningfully
than the others. Common matrixes used for alignment of protein sequences
go by the names Blossum, and PAM.
-
Meaningful - capturing enough of the true history relating the sequences
that testable hypotheses may be formed and found true For example,
a meaningful sequence alignment between two divergent proteins could be
tested if the 3D structures of the two proteins were solved and the atoms
were juxtaposed in 3 dimensions as predicted by the alignment. A
meaningful evolutionary tree computed from sequence data would be tested
by comparison to fossil data. Generally, when algorithms are invented
they are subjected to testing on data sets where some such outside data
is already available. Alternatively, they may be tested on simulated
data. To the extent that they perform meaningfully on the test data,
they are proposed to perform meaningfully on other data..
-
Neutral - a variant allele having a selective coefficient of 0.
More generally, a variant allele having a selective coefficient so small
that random variation in allele frequencies is expected to overpower any
directional effect of the fitness difference.
-
Neutral Drift - the process by which neutral alleles periodically
arise and displace each other from the population due to random fluctuations
in allele frequency.
-
Neutral divergence - In evolution, the history of sequence changes
in a gene not attributable to the action of selective pressure. The
rate of neutral divergence affecting all genes in an organism is expected
to be the same.
-
Neutral rate. - the rate of neutral divergence. The neutral
rate for an organism is determined by observing the divergence of its pseudogenes,
or of other presumed nonfunctional sequences.
-
Orthologous, orthologues - homologous genes that began their divergence
by the speciation of the organisms carrying them.
-
Outgroup - sequences added to a tree that are believed through some
outside knowledge to be more distantly related than the sequences of interest.
The position at which the outgroup joins represents the position of the
root within the ingroup.
-
Paralogous, paralogues - homologous genes that began their divergence
by gene duplication. The divergence time of paralogous genes between
two organisms will be longer than the divergence times of the organisms
themselves. The term is derived from the latin for "falsehood", because
the inadvertent use of paralogous genes to estimate an organism's divergence
time will give a false answer. Horizontally transferred genes will
also give a false answer, but these comparisons are less commonly referred
to as paralogous. When paralogues remain in the same genome, the implication
is that they have acquired specialized variations of the same function.
Otherwise, one or the other would have been lost due to lack of conservative
selection.
-
Pseudogene - a gene that lost function at some past time.
These genes, being affected by only neutral divergence, are a basis for
measuring the neutral rate for an organism.
-
Root - the position of the oldest common ancestor on an evolutionary
tree. For sequence data, the root may be estimated as the point that
is equi-distant in divergence from all sequences on the tree. This
method of locating the root assumes that divergence occurred at an equal
rate throughout the tree. To estimate the root, an outgroup may be
added to the tree.
-
Saturation, saturation curve - the point at which additional changes
between two sequences will not cause any more differences, because all
sites that can change have already changed. A saturation curve plots
the differences observed (or expected) between two diverging sequences
as a function of time. At saturation, the slope of the curve is zero,
meaning that the number of differences observed is no longer sensitive
to the divergence time. The residues in a protein vary in their rate
of change, and hence saturate at different times. This delays the
approach to saturation and gives each protein a different saturation curve
shape depending on its proportions of relatively non-conserved and relatively
conserved residues. When the saturation curve is well corrected by
some statistical model that estimates the number of changes from the number
of differences, the curve will become linear with time. However,
the envelope of uncertainty will increase with time and go to infinity
at the saturation time. In practice, the proteins will have become
unalignable, and unrecognizable as homologues by this point, except perhaps
with the aid of structural homology.
-
Selective coefficient - the difference between fitness of an allele
versus a reference allele for that gene. The reference allele is
generally the wild type allele.
-
Similarity - a measure of the relatedness of two sequences in a
particular alignment. The purpose of an alignment algorithm is to find
an alignment with the maximum similarity score. Similarity is computed
from the number of identities, the kinds of differences, and the number
and length of gap positions. The parameters used to calculate the similarity
consist of a matching matrix giving a partial matching score for
various mismatches and gap penalties that are idiosyncratic to each alignment
algorithm, and may be adjustable by the user. For an alignment
algorithm to produce a meaningful alignment, the mismatch and gap penalties
should reflect the relative distribution of kinds of mismatches and kinds
of gaps observed in nature.
-
Simulated data - sequence data created by inventing a common ancestral
sequence and evolving it according to a preconceived tree using a random
number generator to simulate random mutagenesis. A model for conservative
selection may be imposed. Simulated data is generally used in the
testing of tree construction algorithms. Although simulated data
may may not capture all of the subtleties of the real evolutionary process,
it can be generated abundantly and modified to model various aspects of
molecular evolution.
-
Wild type - the most common allele found in the population.
Lacking population data, the wild type may be considered the allele in
some reference strain. More conceptually, wild type may refer to
a collection of alleles having essentially the same fitness and representative
of most organisms in a population. Ie. the wild type is that which
which a new allele introduced into the population must compete.