Screen shots of 4 color sequence data in Consed.


Modern DNA sequencing is now moving towards capillary sequencers.  A liquid polymer is used in place of a gel in a thin capillary bed.  New polymer is pumped into the capillary between reads.  Because the capillar is very small, the voltage is raised and run time can be reduced.  Detection is by four color fluorescence.

Loading of the sample is by electrokinetic injection.  The tip of the capillary and an electrode are dipped by a robotic mechanism into a sample well in a tray (typically 96 wells), and the sample is loaded by electrophoresis.  Then the tip of the capillary is transferred to a buffer chamber.  The instrument can have 8, 16, or even more capillaries loaded in parallel.  The UTHSCSA sequencing facility has 2 16 capillary ABI sequencers.

The chromatograms (called "traces") are read by software.  There is generally software on the intrument, although the traces can also be transferred to another computer for use of more elaborate calling and assembly software.  The emission frequencies of the dyes overlap.  Generally a preprocessing step is run on the instrument to subtract out the overlapping signals and to normalize peak heights prior to display and base calling.  Since there are different dye sets in use by different manufacturers, the software processing the traces must use parameters specific to the dye set.  Some of the dyes may alter mobility enough to require correction of peak positioning.
 

An example of a small section of a 4 color sequence chromatogram:


From this figure notice that:

  1. The peaks are well separated and there is only one color at each position.  In reality, the fluorescent emissions of the 4 dyes overlap.  The software has preprocessed the data to remove these overlaps.  Otherwise you might see that, for example, the green A peaks consistently had a substantial orange G peak underneath them.  Depending on how well the particular software and dye set have been matched, such an effect may be apparent in the processed data, and is not a cause of concern.
  2. The spacing is not completely regular.  This is usually due to secondary structure in the oligonucleotides as they run through the capillary.  The effect is often called "compression" because when it is severe it causes 2 or more peaks to be compressed into a single position on the chromatogram.  In this case the effect is very slight, however the software has indicated by listing some bases in lower case, and by shading the base call on the 2nd line, that there is increased likelihood of an error hiding in those places in the sequence where the spacing is irregular. In this case the risk is that the first G is really a doublet.
  3. This particular display comes from the program consed, which is the graphic interface of the phredPhrap programming system.  The lower line of colored letters contains the original automatic base calls.  The middle line also reflects those base calls, although it may be edited by the user.  The upper line is the consensus of multiple reads in this area.  From the shading of the upper line, one can tell that the software has called the consensus highly reliable considering the quality of all reads in this area.  The logic of this system is to compile reads until the program declares the consensus to be high quality at all positions.  One is supposed to avoid spending much time examining the individual reads; and one is supposed to avoid second quessing the automated calls and altering the sequence by visual inspection.
  4. The numbering system at the very top is the numbering of the consensus sequence.  The numbering system below that is the numbering of the individual read.  Prior to assembly, the software identified the first 41 bases of the read as vector sequence, and converted those calls to "X".
  5. The web site for phredPhrap is ( http://bozeman.mbt.washington.edu/index.html ).

  6.  

An example of a bad compression

The figure below shows a terrible compression.  This sequence was obtained with dGTP instead of dITP as the reaction substrate.


 

From this figure note:

  1. The reading is from right to left.  The complement of the actual reading is listed.
  2. The characteristics of the compression.
    1. There are too many peaks compressed at 54..57.
    2. There is a decompression following at 58.
    3. There are plausible complementary GC rich sequences at 51, and at 46.
  3. From the consensus, which was judged to be reliable based on other readings, we see that one G has been dragged out of position by 2 nucleotides.  Also the C peak is actually a doublet of two bases that are not adjacent in the sequence.
The following is the same template sequenced with dITP instead of dGTP.

Repeat of the compressed reading with dITP showing that all compression artifacts have been resolved.

Note that all of the compression problems have been resolved.
 

Band drop out.

Sequencing with dITP has its own artifact.  Some peaks are reduced in intensity due to phosphorolysis.

4 color trace showing reduced intensity of peaks at two positions.

Notice that the 2nd T in two instances of a TT is reduced to about 10% of the intensity of the first T.  The excellent senstitivity of the sensor and the fact that thermal cycling removes the early rounds of product from the template before pyrophosphate builds up generally allows the automated sequencers to be more efficient than autoradiographic detection at detecting these reduced bands.  However, in the more compressed regions at the end of the read, these bands can be omitted from the called sequence.
 

Dye blobs.

With dye terminator sequencing, there is an artifact caused by residual unincorporated dye-labelled compounds remaining in the sample and making broad peaks at various points in the chromatogram.  It is necessary to purify the sample after the thermal cycling reaction to remove the unincorporated terminators, or else this effect badly obscures the chromatogram.  However, even with purification, interfering blobs of dye may appear in the chromatogram and mislead the automated base caller.  An example is shown below.

Notice from this chromatogram:

  1. There is a broad band of orange fluorescence spanning from 27..35.  (The orange is a false color for the purpose of display).
  2. The base caller miscalled the region as a run of G's.
  3. The disruption in the profile has caused the entire region to be declared low quality (indicated by lack of highlighting).
  4. In this case, the operator has overruled the automated caller and edited the miscalled bases (indicated by the highlighted bases on the middle line of the sequence)
The dye blob problem tends to be worst near the beginning of the read.  This sample was purified by ethanol precipitation, which is cheap but not the most effective.  Alternative procedures with spin columns tend to be more effective at removing the dye blobs.  For high throughput, there are 96 well formats for conducting the spin column purification.
 

Inclusion of extended data at reduced quality.

As with autoradiographic data, one has to decide whether to cut off reads at the end of high quality data, or to include extended data of lower quality.  Data of lower quality imposes a great burden of manually resolving conflicts between different reads.  However, extended data can be useful in melding together different reads, particularly in a shotgun strategy.

In the  more sophisticated software, like the phredPhrap assembler, the base caller automatically records a quality value for each base call in the reading. The assembler can then use low quality data where that is the only data available, but it can also automatically overrule low quality data when high quality data is available.

In the figure below, the base caller has reduced the quality of  calls where it relied on spacing to resolve doublets and triplets.  The reduced quality is indicated by shading of the middle line of sequence and by the use of lower case letters.  The assembler has coded the consensus as high quality indicating that in aggregate there is good data at every position.  Hence this system allows the inclusion of lower quality data without requiring a great increase in human intervention.

A 4 color chromatogram showing the effect of loss of resolution on quality values.

In this chromatogram, resolution between adjacent peaks is being lost around 200 nt into the read.  One would normally hope to get at least 500 nt before losing resolution between adjacent peaks.

Note that compression would impose a great penalty on correctly calling extended data because regular spacing becomes very important.  Hence one would always choose to use dITP or another analogue if extended data is to be collected.
 

Assembly.

This is an assembly made by Phrap and displayed by Consed.

Notice from this figure:

  1. The consensus is ranked as highly reliable at all positions.
  2. The program has use the more reliable individual reads to overrule one read that is completely unalignable.

  3.  
An important function of this system is that it stores the traces along with the aligned sequence.  Clicking on any base in the alignment will bring up the corresponding trace centered on the clicked base.  This tremendously speeds up human inspection of the quality of the data.

Sequencher.

The software employed by the UTHSCSA sequencing center is Sequencher.  This software aligns reads, and similarly can bring up the relevant trace for examination upon clicking a base in the alignment.  However, Sequencher does not use quality values.  It also relys on the base calling software that is native to the instrument from which the data is ported.  Without the quality values, one is well advised to use a trimming function to remove the low quality ends of reads before assembling the data.

Sequencher's web site is: (http://www.genecodes.com/).