Molecular evolutionary theory, as related to protein family analysis and choice of sites for in vitro mutagenesis.

Conserved residues:

            Most students understand intuitively that functionally important residues may not be able to change much during evolution, and hence residues that don't change may be good targets for in vitro mutagenesis.  In a finer sense, given a variety of residues implicated by structural data to interact with the substrate, the ones that do not change in the family would seem to be essential in their function, while the ones that do change might be either irrelevant or may reflect functional distinctions among the family members.

            As a control for this concept,  pseuduogenes have been investigated.  Pseudogenes have lost function, and their residues demonstrate the evolutionary behavior to be expected of residues that exhibit no conservation.  Pseudogenes evolve at the neutral rate, which is adequate to saturate (randomize) the identity of residues well within the divergence time separating humans and other mammalian orders (such as rodents).  Comparison of the divergence rate of real proteins with pseudogenes reveals the fraction of possible mutations that have been excluded by natural selection.  This measure of conservation varies substantially among different proteins.  Examples are:
 
Protein % variations excluded by natural selection
RNAse 50-70
cytochrome c 90
Pyruvate Kinase 97
Histone 99.7

            Generally speaking, the fraction of changes found to be functionally insignificant after in vitro mutagenesis is much higher than the fraction judged to be acceptable by natural selection.  Natural selection is believed to eliminate alleles with a selective coefficient as low as -0.001.  Biochemical assays are not nearly that sensitive.  This discrepancy between the sensitivities of natural selection and biochemical assay often causes highly conserved residues to reveal puzzlingly little effect upon in vitro mutagenesis.

            It is also apparent that many of the residues that are observed to change are none-the-less under considerable natural selection.  Of course, it is well recognized that chemically conservative residues changes are better tolerated than non-conservative changes.  The nature of the genetic code is such that most one nucleotide changes are chemically conservative.  The numbers tabulated above are for one nucleotide substitutions, and hence are weighted towards the chemically conservative class of amino acid replacements.  If it were simply true that residues exchanged freely within an acceptable set of replacements, but were prohibited by natural selection from changing otherwise, this kind of evolution too would saturate well within the divergence time separating humans and mice (~85 million years)..  This is clearly not true.  Observed increases in divergence occur over a much broader range of divergence time, and can be used to construct evolutionary trees spanning up to 2 to 3 billion years.  This implies that the residues showing partial conservation are prohibited from changing at all in most time periods and then allowed to change in others.  Among the factors that might change from time to time are the identities of the surrounding residues.  This implies a high degree of non-additivity in the impact of residue replacements on the fitness judged by natural selection.  Most of this implied complexity is apparently at the more subtle level of functional revision, since the larger functional differences measured after in vitro mutagenesis tend to be additive.  The functional effects restraining the lesser conserved residues are usually at the subtle level sensed by natural selection but not biochemical assay, and the intuitive sense that lesser conserved residues will not give much of a functional difference during in vitro mutagenesis usually holds true.  But there can be exceptions where mutating a residue to a variant found in other homologues produces a serious loss of function.

Evolutionary Tree:

            An advantage of estimating the evolutionary tree for a protein family under examination is that it can help clarify the difference between orthologues and paralogues.  The existence of paralogues within the same genome implies functional specialization, and sets up in vitro mutagenesis experiments for swapping residues between the paralogous proteins in order to map how the functional distinctions are brought about.  Additionally it helps create objective subdivisions of the protein family called clades.  It may substantially clarify an in vitro mutagenesis plan to distinguish between two residues that interchange freely at a position from two residues that are each conserved within respective clades.  The latter is a promising swap that might reveal a functional distinction between the two residues, whereas the former case is not.

            There are many methods for calculating trees from sequence alignments.  For protein alignments, the best starting point is a neighbor joining tree. This is a relatively fast and reliable method that can be accessed at the bioinformatics facility, and at various places on the web.  It permits a variety of corrections for saturation.  The confidence in the divisions of the tree should be evaluated with a bootstrap analysis.  The bootstrap analysis consists of sampling many random subsets of the alignment positions, recalculating the tree, and then reporting for each clade what fraction of the time all the members were consistently found to map in the clade among those replicates.

Alignment:

            Any statement about conservation of individual residues, whether intuitive or through a formal phylogenetic analysis, relies on the sequences being meaningfully aligned.  This becomes increasingly difficult across more divergent splits in the tree.  A range of programs used for this purpose are discussed in this alignment help document.  When alignments cross divergence that exceeds the point where the validity is obvious to the eye, a more elaborate method of verification is required.  If there are 3D structures in the two divergent clades, then one can ask if the sequence alignment recapitulates the structure alignment.  If not, there are alignment methods that directly incorporate the structure alignment information.  In the absence of 3D structure information, one can predict secondary structure averaged over the individual clades and ask if the sequence alignment sensibly aligns the predicted secondary structure elements.  A help document for secondary structure prediction is available.

Neutral Drift versus Selection.

            There is a long standing argument in molecular evolution about the nature of the changes that do occur.  Neutral theory holds that most of them occur simply because they do no harm.  Selection theory holds that most changes occur due to functional optimization driven by natural selection.  In order to explain the large number of changes over time, it is postulated that the exact requirements on the function of the protein keep shifting.  Whichever theory is correct, it is clear that at the level of sensitivity of biochemical assays most of the naturally occurring changes are effectively neutral.  This should influence your thinking about targets for in vitro mutagenesis. For example, suppose you observe an alpha helix with a proline in it.  This is unusual, so perhaps you would propose to do in vitro mutagenesis to discover why this alpha helix requires being distorted by a proline.  The answer may be "because it does no harm".  The case for investigating this residue would be stronger, if you observed it to be conserved, at least within a clade on the tree.
 
 

Glossary: