Population Genetics of LINE-1

A lecture given at the 18th International Congress of Genetics, Beijing, Aug. 1998.

Summary

This presentation is mainly devoted to showing that the high copy number and high allele frequency of LINE-1 is not an obstacle to application of the transposition-selection equilibrium theory of Charlesworth. Related topics on the web site are:

LINE-1 (L1)

Comparison of LINE-1 with a typical Drosophila transposon

Drosophila transposon LINE-1
10 copies 100,000 copies [note]
10-4 inserts/generation 0.02 - 0.03 inserts/generation
Most are active < 1/1000 are active
Low allele frequency High allele frequency (including active loci)
Excised 10-5/gen. Excised < 10-6/gen.

Whereas a great deal of insight about transposons in Drosophila has been worked out starting with equations of the sort:

n @ ln wn /@ n = -n (u-v)
[read @ as partial derivative of]

these analyses have not been applied to LINE-1. The reasons are:

The point of this presentation is that these perceptions are incorrect. With minor adjustments, the theoretical work developed for transposons in Drosophila and elsewhere can be applied to LINE-1.

What does this mathematical treatment mean and what does it get you?

- n @ ln wn /@ n = n (u-v)

ref: Charlesworth and Langley (1989)

The right side of the equation [n(u-v)] is simply the net influx of copies into the genome as a result of transposition. Its magnitude is usually determined by observation.

The left side embodies an equilibrium hypothesis. It means that flies with fewer copies must be favored in their reproduction over flies with more transposons. This would result in allele frequencies of transposons being subject to negative selection and would force them to drift down towards 0. Equilibrium is achieved when the net loss by fixation to 0 frequency matches the net influx by transposition. The fitness function is generally not known, but proposed to exist as part of the equilibrium hypothesis. Its form is therefore not known, but it is known that it must impose an increasing penalty for each additional copy for an equilibrium to be reached. The emphasis on drifting to 0 frequency is the source of the perception that LINE-1 can't be explained by this treatment.

Starting from equations of this sort, you can address a variety of problems related to transposons.

So there is much we could apply to LINE-1, if we can get past n=100,000 and high allele frequencies. It is important to understand that even if the equilibrium model implied by the equation above is not accurate, it is a stepping stone to more realistic models.

100,000 is the wrong copy number for use in this treatment

Since most of the 100,000 copies are dead, the dead (d) and active (a) copies should be treated separately.

For dead copies:

instead of:    - n @ ln wn /@n = n(u-v)

use:     - nd @ ln wn_dead /@n = (na u - nd v)

Then, if you assume that the active copies come to some equilibrium, nau becomes a constant.

With even a tiny rate of excision, nd will rise until the term on the right goes to 0. The system is self-limiting without any special need for selection against the 100,000 dead copies.

If there is no excision, then, following Charlesworth (1985) you can divide through by nd, integrate up to n copies, and take the exponential to get:

w = n-(n ua)

Using empirical values for human L1, the total fitness after containing the numbers of dead copies at 100,000 by negative selection would be about 0.8. In other words, unless you believe that 20% of us are eliminated from the reproductive population due to problems caused by LINE-1, you have to conclude that the number of dead copies is either at equilibrium with excision or just moving upwards freely.

In practice, the thing to do is to ignore the dead copies and focus on the active ones. The argument above simply says that the genome probably also ignores them. However, a caveat is that if you envision the fitness function as a mechanism that counts up copies and imposes a fitness penalty, ignoring dead copies implies that your mechanism can somehow count active copies and ignore dead ones. [note]

Considering active copies only, is it reasonable to have active copies at high allele frequency if they are constrained by negative selection against the total numbers of active copies?

Ref: Charlesworth and Charlesworth (1983)

The equations to calculate how high the allele frequency can get under negative selection come from that same paper and is shown in a modified form below:

The portion of active L1 copies expected to be composed of alleles of frequency between j and k =

Note that the simplification brought about by the fact that alpha is very low causes the curve shape to depend only on beta (representing loss of activity) and not on alpha (which is where the insertion rate figures in). For a first approximation, assume that alpha (and hence u) are of the right amount to keep n constant.

High allele frequencies become possible as Ne approaches 1/(s + v + i).

Here I've shown a couple of curves made with arbitrary parameters illustrating a mainly low allele frequency distribution and a mainly high allele frequency distribution. The break-even point (1/2 of n from alleles of above 50% frequency) corresponds to the familiar 4 Ne s = 1 in standard drift theory. The red curve uses parameters as best we understand them for humans. Ne is taken as 13,000 representing the number of breeding individuals about 100,000 years ago (Cavalli-Sforza's "History and Geography of Human Genes"). The only force acting to oppose fixation used to compute this curve was inactivation by base substitution.

This produces a significant contribution to n from alleles above 10% in frequency. This isn't meant to be an actual prediction of the human active allele frequency, but just to show that accumulation of active alleles at high allele frequency is perfectly plausible under circumstances relevant to the human situation.

The equation describes a stationary distribution. With a changing population size, the distribution will tend to move towards that computed for the current population size. The movement should be slow at large population sizes, but faster at low population sizes. That is why the current distribution would be expected to retain high frequency alleles from events earlier in human history. I estimate that it would take about 0.5 Myr to completely lose the high frequency end of the distribution. However, the active L1 alleles being produced more recently will be distributed with a low allele frequency. So the true distribution today should be a composite of that from long ago with that representing more recent events. One could argue that even smaller effective population sizes from earlier in human history would shift the curve further to the right.

The equation is not set up to force n to stay constant. Since the alpha term (incorporating influx) does not influence the curve shape, one could set it so that n was increasing or decreasing and get the same computed curve shape but integrating to > 1 or < 1. Of course, there is no such thing as a stationary distribution if n is not stable. So if alpha (and accordingly u) is too high to balance beta, the true situation would be that there would always be an additional component of low frequency alleles representing the excess influx of young inserts forcing n to increase.

For the red curve that I plotted, I'd have to reduce u by about 30% over what I think it really is to bring n to a constant. Alternatively (and more likely as described below), there are more components of (s+i+v) than represented in the calculation. The latter would shift the curve some to the left.

Active loci are heterogeneous

In other words, some of the allelic frequency being attributed to "active" loci is from alleles that have already undergone the "inactivation" process represented by the i term in the distribution equation. So if we try to match natural selection's stringency for activity, the active allele frequencies are going to be lower, in the order of 20-30%. Alternative, if we accept all alleles that show any activity as "active", then the stringency of i will have to be reduced to accept more amino acid replacements, and the curve will shift accordingly to the right.

What does it all mean?

In particular, as we try to get the selective forces right to explain what keeps the copy number under control, equations like

n @ ln wn / @n = -n(u-v)

should work fine. The allele distribution doesn't matter for those issues. Unless

Human and mouse L1 observations are at different scales

The observation of high frequency active alleles in humans is then related to the bottleneck at the beginning of the modern species being more recent than the lifetime of an active L1 locus. If we turn our attention to another species, the mouse, we see a much different picture. The expected lifetime of an active locus in the mouse is about 10 x shorter than in humans, because of the higher base substitution rate (represented in the term i). The species were also formed earlier. The present day active L1's in mouse will have an allele distribution representing whatever the lowest populations sizes were in the last 40,000 years. Investigation of earlier periods of mouse L1 activity through evolutionary tree construction should be expected to encounter periods dominated by bottlenecks and other periods dominated by consistently high population numbers.

Is LINE-1 at equilibrium?

Abundance of mouse L1 inserts of different ages.

On average mouse LINE-1 has overreplicated by no more than 3% per L1 generation over the last few million years. This curve was reported in Hardies et al., 1986). This empirical curve matches what you get if the number of active L1's per mouse genome increase by 3% over each locus lifetime of about 40,000 yrs. It's also not clear that some of the decline in copy number from previous time periods is because they were physically removed. On average, mouse L1 can't be very far away from equilibrium.

There is no mechanism (yet) that would bring about an equilibrium

Mouse L1 lineages show surges in output

The A2 lineage from Mus spretus appears to show a burst of output amounting to about 1000 copies. This burst is limited to a period of 100,000-200,000 yrs. within a lineage of 1 Myr in duration.

Could L1 be a nonequilibrium system that fluctuates around an equilibrium point?

Alternatively, are the surges artefacts caused by population size fluctuation?

Mouse L1 is organized into longstanding separate lineages, some of which have distinctive control sequences.

Coalescence within an individual mouse L1 lineage is relatively rapid, however there are longstanding separate lineages that seem resistant to coalescence. This leads to the idea that there are several different families of L1 that are independent in their behavior. Reinforcing that view is the fact that there are several different promoters for mouse L1, and the discovery that one of the currently active mouse L1 lineages acquired its promoter by recombination with an earlier lineage. See Naas et al., 1996.

LINE-1 lineages have transferred between Mus spretus and Mus domesticus .

Of the 3 best characterized LINE-1 l lineages of the mouse, one of them (the L1MdA2 lineage) splits between a Mus spretus version and a Mus domesticus version at about the expected time for the separation of these two species (~1 Mya). Another (the L1Md4 lineage, which is more recently being called the TF lineage) splits between spretus and domesticus at about 1/2 that time, suggesting a transfer of an active locus after speciation. Another lineage which we are currently characterizing in detail, the Z lineage, also shows a split late after speciation.

Interspecies transfer of transposons has been of intense interest in Drosophila. In the case of the mouse, we require no special mechanism because the two species can still interbreed with fertile offspring.

Reinforcing the idea that there is a low amount of genetic interchange between the mice, we have found some Mus spretus-specific truncated LINE-1 sequences in the Mus domesticus inbred strain C57BL/6J. At least one of these seems to have transferred in the context of Mus spretus flanking sequence.

The total amount of DNA exchanged in recent times seems to be about 0.5% of the genome. So this is an opportunity to study the establishment of a new LINE-1 family in a species after introduction from outside, with the unique component that we have some information on the rate of introduction.

SUMMARY