Screen shots of 4 color sequence data in Consed.
Modern DNA sequencing is now moving towards capillary sequencers.
A liquid polymer is used in place of a gel in a thin capillary bed.
New polymer is pumped into the capillary between reads. Because the
capillar is very small, the voltage is raised and run time can be reduced.
Detection is by four color fluorescence.
Loading of the sample is by electrokinetic injection. The tip
of the capillary and an electrode are dipped by a robotic mechanism into
a sample well in a tray (typically 96 wells), and the sample is loaded
by electrophoresis. Then the tip of the capillary is transferred
to a buffer chamber. The instrument can have 8, 16, or even more
capillaries loaded in parallel. The UTHSCSA sequencing facility has
2 16 capillary ABI sequencers.
The chromatograms (called "traces") are read by software. There
is generally software on the intrument, although the traces can also be
transferred to another computer for use of more elaborate calling and assembly
software. The emission frequencies of the dyes overlap. Generally
a preprocessing step is run on the instrument to subtract out the overlapping
signals and to normalize peak heights prior to display and base calling.
Since there are different dye sets in use by different manufacturers, the
software processing the traces must use parameters specific to the dye
set. Some of the dyes may alter mobility enough to require correction
of peak positioning.
An example of a small section of a 4 color sequence chromatogram:
From this figure notice that:
-
The peaks are well separated and there is only one color at each position.
In reality, the fluorescent emissions of the 4 dyes overlap. The
software has preprocessed the data to remove these overlaps. Otherwise
you might see that, for example, the green A peaks consistently had a substantial
orange G peak underneath them. Depending on how well the particular
software and dye set have been matched, such an effect may be apparent
in the processed data, and is not a cause of concern.
-
The spacing is not completely regular. This is usually due to secondary
structure in the oligonucleotides as they run through the capillary.
The effect is often called "compression" because when it is severe it causes
2 or more peaks to be compressed into a single position on the chromatogram.
In this case the effect is very slight, however the software has indicated
by listing some bases in lower case, and by shading the base call on the
2nd line, that there is increased likelihood of an error hiding in those
places in the sequence where the spacing is irregular. In this case the
risk is that the first G is really a doublet.
-
This particular display comes from the program consed, which is the graphic
interface of the phredPhrap programming system. The lower line of
colored letters contains the original automatic base calls. The middle
line also reflects those base calls, although it may be edited by the user.
The upper line is the consensus of multiple reads in this area. From
the shading of the upper line, one can tell that the software has called
the consensus highly reliable considering the quality of all reads in this
area. The logic of this system is to compile reads until the program
declares the consensus to be high quality at all positions. One is
supposed to avoid spending much time examining the individual reads; and
one is supposed to avoid second quessing the automated calls and altering
the sequence by visual inspection.
-
The numbering system at the very top is the numbering of the consensus
sequence. The numbering system below that is the numbering of the
individual read. Prior to assembly, the software identified the first
41 bases of the read as vector sequence, and converted those calls to "X".
-
The web site for phredPhrap is ( http://bozeman.mbt.washington.edu/index.html
).
An example of a bad compression
The figure below shows a terrible compression. This sequence was
obtained with dGTP instead of dITP as the reaction substrate.
From this figure note:
-
The reading is from right to left. The complement of the actual reading
is listed.
-
The characteristics of the compression.
-
There are too many peaks compressed at 54..57.
-
There is a decompression following at 58.
-
There are plausible complementary GC rich sequences at 51, and at 46.
-
From the consensus, which was judged to be reliable based on other readings,
we see that one G has been dragged out of position by 2 nucleotides.
Also the C peak is actually a doublet of two bases that are not adjacent
in the sequence.
The following is the same template sequenced with dITP instead of dGTP.
Note that all of the compression problems have been resolved.
Band drop out.
Sequencing with dITP has its own artifact. Some peaks are reduced
in intensity due to phosphorolysis.
Notice that the 2nd T in two instances of a TT is reduced to about 10%
of the intensity of the first T. The excellent senstitivity of the
sensor and the fact that thermal cycling removes the early rounds of product
from the template before pyrophosphate builds up generally allows the automated
sequencers to be more efficient than autoradiographic detection at detecting
these reduced bands. However, in the more compressed regions at the
end of the read, these bands can be omitted from the called sequence.
Dye blobs.
With dye terminator sequencing, there is an artifact caused by residual
unincorporated dye-labelled compounds remaining in the sample and making
broad peaks at various points in the chromatogram. It is necessary
to purify the sample after the thermal cycling reaction to remove the unincorporated
terminators, or else this effect badly obscures the chromatogram.
However, even with purification, interfering blobs of dye may appear in
the chromatogram and mislead the automated base caller. An example
is shown below.
Notice from this chromatogram:
-
There is a broad band of orange fluorescence spanning from 27..35.
(The orange is a false color for the purpose of display).
-
The base caller miscalled the region as a run of G's.
-
The disruption in the profile has caused the entire region to be declared
low quality (indicated by lack of highlighting).
-
In this case, the operator has overruled the automated caller and edited
the miscalled bases (indicated by the highlighted bases on the middle line
of the sequence)
The dye blob problem tends to be worst near the beginning of the read.
This sample was purified by ethanol precipitation, which is cheap but not
the most effective. Alternative procedures with spin columns tend
to be more effective at removing the dye blobs. For high throughput,
there are 96 well formats for conducting the spin column purification.
Inclusion of extended data at reduced quality.
As with autoradiographic data, one has to decide whether to cut off reads
at the end of high quality data, or to include extended data of lower quality.
Data of lower quality imposes a great burden of manually resolving conflicts
between different reads. However, extended data can be useful in
melding together different reads, particularly in a shotgun strategy.
In the more sophisticated software, like the phredPhrap assembler,
the base caller automatically records a quality value for each base call
in the reading. The assembler can then use low quality data where that
is the only data available, but it can also automatically overrule low
quality data when high quality data is available.
In the figure below, the base caller has reduced the quality of
calls where it relied on spacing to resolve doublets and triplets.
The reduced quality is indicated by shading of the middle line of sequence
and by the use of lower case letters. The assembler has coded the
consensus as high quality indicating that in aggregate there is good data
at every position. Hence this system allows the inclusion of lower
quality data without requiring a great increase in human intervention.
In this chromatogram, resolution between adjacent peaks is being lost
around 200 nt into the read. One would normally hope to get at least
500 nt before losing resolution between adjacent peaks.
Note that compression would impose a great penalty on correctly calling
extended data because regular spacing becomes very important. Hence
one would always choose to use dITP or another analogue if extended data
is to be collected.
Assembly.
This is an assembly made by Phrap and displayed by Consed.
Notice from this figure:
-
The consensus is ranked as highly reliable at all positions.
-
The program has use the more reliable individual reads to overrule one
read that is completely unalignable.
An important function of this system is that it stores the traces along
with the aligned sequence. Clicking on any base in the alignment
will bring up the corresponding trace centered on the clicked base.
This tremendously speeds up human inspection of the quality of the data.
Sequencher.
The software employed by the UTHSCSA sequencing center is Sequencher.
This software aligns reads, and similarly can bring up the relevant trace
for examination upon clicking a base in the alignment. However, Sequencher
does not use quality values. It also relys on the base calling software
that is native to the instrument from which the data is ported. Without
the quality values, one is well advised to use a trimming function to remove
the low quality ends of reads before assembling the data.
Sequencher's web site is: (http://www.genecodes.com/).