Different bootstraps from different algorithms
Consider the following test sample: 4 sequences (A,B,C,D) identical
at all positions except that (A,B) differ from (C,D) at one position. PAUP
calculates a bootstrap for (A,B)(C,D) of 64 using parsimony, and 76 using
NJ. Mega calculates a bootstrap of 100 using NJ. The theoretically "correct"
value is e-1=63.2. This is from the binomial for the
probability of not sampling the 1 informative position in each of 100
draws: (1-1/n)n = e-1 using the approximation:
(1-p)n = e-np.
PAUP with parsimony approaches the theoretical value in this and all other
cases tested; PAUP with NJ slightly exaggerates the support because it "breaks
ties" (randomly assignes support to some topology when it finds no support
for any topology). PAUP parsimony is therefore always more conservative than
PAUP NJ in the low divergence zone. MEGA NJ first finds a best tree and then
breaks ties always in favor of the best tree. It is therefore capable of
being disasterously optimistic about bootstrap support in the low divergence
zone.
We use PAUP parsimony bootstraps. NJ also has tremendous problems with the
missing data from truncation and recombination patterns, so we can only use
it on extensively pruned data anyway.
Also note that if one inadvertantly filters informative sites prior to
bootstrap sampling, the bootstrap value becomes exaggerated. For example,
if one filters the test set above, there is only one position left and the
bootstrap will artefactially rise to 100. The effect sets in when there are
less than 10 total positions, so filtered datasets can be used so long as
10 or more positions remain. Alternatively, they can be supplemented with
10 dummy constant positions.
Low power
Consider the following test data, each of 4 sequences with c positions
supporting (A,B)(C,D) and d positions discordant with that grouping and all
others invariant: (nrep=10000)
c d bt.
1 0 64
2 0 87
3 0 95
4 0 98
2 1 69
3 1 83
4 1 91
5 1 95
5 2 87
5 1,1 92
6 1,1 96
The fact of whether you got 2 or 3 positions to support a topology
generates a low power problem that we just have to live with in the low
divergence zone. Note however, that the fact of whether you happen to
get 0 or 1 discordant positions is a super low power problem that has
an even stronger effect on the boot value. If you knew the probability (b)
of getting a position that grouped (A,B)(C,D) by chance, then the
confidence that the clustering was not caused by chance would be of the form
1-bc.
The number of discordant positions would be irrelevant. Note that the
way the resampling probability works is as if to assume that the probability
for a misinformative position is .36 if there are no discordant positions.
If there is one informative position the boot is computed as if the
probability of any informative position being misinformative is .55, etc.
Thus the bootstrap is substituting a sampling probability for the probability
of a misinformative position, and it is adjusting that from the number
of discordant positions. There are two problems with that:
- The result (the boot value) is not mathematically the same
as a confidence, so it's hard to know where to draw
the threshold if you want, say, 95% confidence.
- When the number of discordant positions is small (like 0 or 1)
then there is a big uncertainty thrown into the boot
value by the original evolutionary sampling.
Parametric Statistical Analysis
To deal with this, we try to get an estimate of the probability of a
misinformative position and do the math to supplement the boot value for
the final evolutionary model.
Note that this is a bare-bones method designed for a quick pencil and paper
estimate that relates the numbers of informative positions and apomorphies
found by inspection or by PAUP to the ultimate likelihood that one is
not chasing noise. It is not intended to preclude doing more elaborate
statistical analysis of a conclusion drawn from the data. The problem
is that due to the bulk of data,
computational limitations and a high number of possible
segmentation patterns, one has to make decisions about which sequences
to include prior to acquiring an alignment suitable for computationally
intensive analysis. There just has to be a computationally nonintensive
way to make these decisions.
The math is based on a parameter, b, which is an outside estimate of
the probability that an informative position is misinformative. For
a real clustering, there will be several positions supporting the
clustering, and there will be a series of parameters b1,
b2, b3, etc. estimating the probability that
the first is misinformative, the 2nd is misinformative, etc.
The confidence in the clustering is then
1-b1 * b2 * etc.
The b values might be obtained a number of ways:
- From the noninformative divergence in the sequence:
- b1 = d2 * N * M * (M-1)/T where
- d is the average divergence of a group considered
a radiative cluster.
- N is the len of sequence
- M is the number of sequences
- T is the total number of informative sites.
- if b greater than or ~=1, then b=1
- b2 = d2 * N * X /T where
- X is the expectation for once supported misinformative
sites (b1 before division by T)
- if X is less than 1, then X = 1.
- Higher terms are calculated as for b2
- As above, but with the expected divergence from a wider data set, or
from an evoutionary model.
- From the fraction homoplasy in this or comparable datasets. In particular
this empirical method lets us deal with the model that there is
an especially high frequency of single position gene conversion among
nearly identical sequences, which then falls off as they diverge.
This scenario, for which there is some evidence, introduces a
high level of noise which drops off with divergence, contrary to
the models behind virtually every evolutionary algorithm. We can
see what this does to our statistics if we extact the b value from
a comparable dataset for which the tree has been worked out, divide
out the specifics of N and M and replace them with the relevant
number from this dataset.
- As an arbitrary parameter of .3 to .5. That's no worse than what the
bootstrap does, and it gives a quick estimate for cases that the
bootstrap algorithm doesn't handle well.
Uses of the b parameter
- Estimating confidence in one of n positions.
-
If we make an oligo probe, it doesn't matter if the
other positions supporting the clustering are
correct; only the one in the probe affects how it
will behave. Confidence in that one is 1-bn.
- Empirically we know that some fraction of single nucleotides
that are correctly informative, are also present in
parallel on different lineages, thus messing up the probe.
The empirical b value in this case is divided by the
internal length of the tree instead of M(M-1) (since we only
care about parallelisms on internal edges) and then
multiplied by an estimate of the length of the relevant
part of the theoretical total LINE-1 tree.
This in turn says how many probes can be expected to
malfunction and how many will be needed tolerate
that fraction failures.
- We do a lot of experiments where a arrayed LINE-1 clones
are probed with a battery of oligo probes. The correlation
between the array counts and the evolutionary tree
is governed by essentially this same math.
- To estimate confidence that the lineage exists, versus whether
all members are in it.
Say sequences A,B,C group with some bootstrap value. The
bootstrap estimates the probability that one of A,B,C
doesn't belong in the cluster. We often don't care
if all three belong there; only that 2 of them do, which
is enough to know that the lineage exists. That confidence
where n positions support A,B,C, is found by squaring each
b term (or more generally taking each b to the power of
number of taxons in the group that agree at that position -1).
- To diagnose recombinants.
Most recombinant LINE-1s are found by inspection, looking for
a run of informative positions that switches affinity
of one sequence
to a different subset of sequences in an alignment.
This math provides a criterion for how many positions
you want to see switch affinity before marking a candidate
recombinant for further study.