Phrep-Phrap at UTHSCSA
Phrep-Phrap is a powerful system for the assembly of overlapping sequence
readings. The programs were obtained from the University of Washington
Genome Center. The home
page for Phred, Phrap, and Consed is at http://bozeman.mbt.washington.edu/index.html.
See Status page for any late arising changes
in the function of the UTHSCSA installation.
Phred is a base caller for four color sequence chromatograms
(found in an .ab1 or .scf file) that estimates quality values for each
base call. These are stored in a text file format called a .phd file.
Phrap is an assembler that takes the quality values into account
so that good data overrules bad data and computes an overall quality value
for each consensus base.
phredPhrap is a script that performs the following sequence of
actions:
-
Phred is called for each chromatogram unless a .phd file is already provided.
To use the base caller native to the sequencer instead of Phred, just the
.phd files directly. phredPhrap only invokes phred for chromatogram
files that don't already have a corresponding phd file.
-
The sequence reads are screened against vector sequence, which is masked.
-
Repetitive sequence (ie. Alu, etc.) can also be tagged. Note: the repeat
sequence library for repeat screening is not implemented. If you wish to
screen for repeats, contact Steve
Hardies
-
Phrap is called to assemble the data into one or more contigs, with marking
of segments that match to multiple places. The output is called an
.ace file.
-
Note: phredPhrap does not require graphics support.
Consed is an X-windows graphics viewer of the .ace file. Consed
features shading of individual and consensus sequences to indicate estimated
quality of each base. It gives one click access to the original chromatograms
at any specified alignment position. It allows tagging of individual
readings or the consensus with comments. It also supervises the addition
of more sequence reads to the project, and the recommendation of primers
for finishing. Finally, the main help file for the system is accessed from
the Consed window. The preferred method for accessing X-windows programs
on bcf from Microsoft windows platforms is VNC
viewer.
Setting up to use Phred Phrap.
-
You will need an account on bcf. If you do not have one, apply
for a bcf account here.
-
Make sure /home/hardies/apps/genome/bin is in your $PATH variable
-
eg. Put export PATH=/home/hardies/apps/genome:$PATH in your .bash_login
or .bash_profile.
-
(if .bash_profile is present in your home directory, it controls your environment
upon login. If .bash_profile is not present, then .bash_login controls
your environment upon login.)
-
If you encounter any reference to an implementation in /share/apps/genome
in our help pages, that is an anachronism. Instead look in the /home/hardies/apps/genome
directory for the sought after function.
-
Copy the file named vector.seq from /home/hardies/apps/genome/lib/
to your home directory.
-
Each user should customize this file by adding the vector sequences peculiar
to their sequencing project.
-
If the vector sequence for your clones is not present in vector.seq,
add it according to the format of the other sequences in the file.
Add the statement export CROSS_MATCH_VECTOR=<pathname>vector.seq
to your .bash_profile or .bash_login files.
Naming conventions and determineReadTypes.perl
-
As described in the documentation, the script phredPhrap would run another
script named determineReadTypes.perl which extracts information about the
direction of reads and library names from the phd filenames and edits the
phd files internally to make this information available for some of the
advanced functions of phrap and consed.
-
Since different users use different file naming conventions, this sets
up a situation where each user may want to customize their own version
of determineReadTypes.perl.
-
In our implementation, phredPhrap does not run determineReadTypes.perl.
Instead the user is expected to enter the phd directory of the project
and run determineReadTypes.perl with his/her own customized version of
the script first in path. determineReadTypes.perl requires no arguments.
It automatically detects all files in the phd directory that haven't yet
been processed and processes them.
-
These functions support assembly operations that use the expected insert
size of a given clone library to constrain the positioning of the right
and left end sequence reads from the same clone within the developing contig.
This is particularly helpful if sequencing a genome full of repetitive
sequences which will generally fail to properly asemble just by overlapping
sequence reads.
-
The version of determineReadTypes.perl in /home/hardies/apps/genome/bin
(which you will get by default) implements the naming conventions used
at UTHSCSA (referred inside the script as SCHformat):
-
The first two or more letters indicate the project and library, followed
by a clone number, followed by .r. or .f. indicating forward or reverse
primers, followed by other identifying information generally appended by
the sequencer controller. The .f. and .r. are case sensitive.
-
Several other conventions are described inside determineReadTypes.perl,
and may be substituted by changing the name of the subroutine called within
that script.
-
Or any user could implement their own naming conventions by writing and
embedding their own subroutine in their own copy of determineReadTypes.perl.
-
When using a customized version, it is important to either give it a different
name, or to make sure it is first in path. Otherwise, the version
in /home/hardies/genome/bin will run instead. If in doubt, the command
which determineReadTypes.perl will reveal which version is first
in path.
-
The information written by determineReadTypes.perl appears at the bottom
of each phd file processed, and can easily be inspected. Examining
it can be helpful when configuring parts of the system that use the information.
-
It is most convenient to match the convention enforced by determineReadTypes.perl
to the naming convention used during the set up of the sequencing runs.
However, when dealing with personnel that fail to follow a consistent naming
convention, we have had the occassion to write a filename translator to
retroactively force the file names into a coherent pattern. Since
the phd file (base calls and confidence values) refers to the corresponding
.scf file (the actual chromatogram) internally, the translator would have
to rename both files in concert, and change the reference inside the phd
file.
First time users.
Work through the "quick tour" in the Consed help file by the steps listed
below. Some functions of the system may require additional set up.
Contact Dr. Steve Hardies or Dr.
Borries Demeler for further assistance.
-
For each "project" (a set of sequences expected to be overlapped) you will
create a directory with 4 subdirectories by the names chromat_dir, chromats_to_add,
edit_dir,
and phd_dir.
-
cp -r /home/hardies/apps/genome/genome.work/standard $HOME/. will
create such a project named standard in your home directory with
the test data used by the "quick tour" section of the consed help system.
-
There are also test datasets for polyphred, assembly_view, and autofinish
in the genome.work directory. Polyphred and autofinish have never
been tested at this site, and may not be fully configured yet.
-
Polyphred deals with diagnosing heterozygous bases in a sequence chromoatogram.
-
Atuofinish deals with recommending clones and primers to do a directed
finishing of the sequence.
-
Set up your X-windows interface if you are accessing the system from a
remote computer. We recommend using VNC viewer. See help link
above.
-
Set the default directory to the edit_dir subdirectory of your standard
subdirectory.
-
Start consed by typing consed.
-
In one of the two windows that open, click on <help> and <show documentation>
-
Scan down the document to the quick tour segment.
-
Skip down to step 4 where it says to double click on <standard.fasta.screen.ace.1>
-
Proceed through the quick tour from there.
General steps for starting a new project.
-
Sort out file naming conventions to match a version of determineReadTypes.perl
and make sure the sequencing personnel understand it. In particular,
make sure they understand any case sensitive aspects of the convention.
-
Create a subdirectory for the project with the four subdirectories mentioned
in step one above.
-
Copy the chromatograms (.ab1 files, or .scf files) from the sequencer or
a computer reading a disk written at the sequencer by scp to the chromat_dir
subdirectory of the new project.
-
If you use base calls by a program other than phred, collect the
resulting .phd files and copy them by scp to the phd_dir subdirectory of
the new project. We have observed numbers of format problems with
.phd files output from other systems, and have a number of work-around
solutions to these problems. If you try this approach and have problems,
consult Dr. Steve Hardies.
-
Make sure the vector sequence for the clones is in the vector.seq file
mentioned in the Set Up section above.
-
Set the default directory to the edit_dir subdirectory of the project.
-
Type phredPhrap to call the reads and assemble contigs. [note the
odd capitalization pattern].
-
If working remote, make sure your X-windows interface is set up.
-
Type consed and click on the .ace file in the window that opens
to view your results.
Help on specific operations:
-
To get help in general: Access the help button from within consed.
-
The consed help is also found in a text file named README.txt in /home/hardies/genome/genome.work
-
There are more specific technical help files on phred and phrap there also,
named phred.doc and phrap.doc
-
To view chromatograms for "singlets".
-
Singlets are sequence reads that do not overlap other reads.
-
Singlets are not included in .ace files and not shown by consed.
-
Singlets are listed in the edit_dir directory in a file named <project>.fasta.screen.singlets
-
After running phredPhrap for the project, set the working directory to
the phd_dir subdirectory of the project. There should already be
a .phd file for the singlet.
-
phd2Ace.perl <singlet phd filename> will produce a file
named <singlet name>.ace in the phd_dir.
-
mv *.ace ../edit_dir/.
-
set the directory to edit_dir and run consed as usual. The singlet .ace
should show up as one of the options to open and view as usual.
-
To prevent phrap from overlapping reads due to a short chance match.
-
The parameter minmatch controls the minimum length of a match for phrap
to make an overlap.
-
Set minmatch to a value higher than the default during the phredPhrap run
as follows: phredPhrap -minmatch 20
-
Of course, this may prevent a true join with only 19 bases overlap from
being made. This becomes less of a problem as the project matures, since
there should be much more overlapping data than that everywhere.
-
Alternatives:
-
If the overlapping problem is really a stray bit of vector:
-
Check that you have correctly included all necessary sequence for vector
screening.
-
You can manually convert stray bits of vector (or any interfering sequence
that is of no value to the assembly) to n's in the trace view window.
During subsequent reassembly these bases will be ignored.
-
You can usually manually remove a read from a contig by right clicking
it and selecting the appropriate option. However, if the falsely
joined read extends the contig and adds to the consensus, consed will not
let you remove it.
-
If a gel removed to a new contig by the above operation really belongs
in another contig, you may use the compare operation on the display contig
window to manually align and join the two.
-
In the same menu, you may click "Tell phrap not to overlap reads discrepant
at this position", if there are high quality discrepancies to mark.
Then save assembly and rerun phredPhrap. This is described in the
consed help. Presumably you can elevate certain bases to high quality
using the trace display window if necessary to carry out this operation.
-
Ignore it until enough data that properly overlaps the read acrues to automatically
pull it to a correct position in the assembly. [preferred approach]
-
Change one or more bases in the overlap of the errant gel to an N in the
trace view window. This is a last resort before you throw the read
away altogether by deleting it from the phd_dir and chromat_dir directories
and rerunning phredPhrap. It would be better to only throw away one
base of data than an entire read.
-
Execution time
-
The slow step is the assembly by phrap.
-
With each new batch of reads, one has the option to maintain the existing
alignments and just add the new reads (an operation supervised by Consed),
or to reassemble all the reads together (done by running phredPhrap again).
For a large project, obviously adding the batches incrementally is more
efficient in execution time.
-
For large projects:
-
Note the manuever described in the Consed help concerning faster Consed
startup.
-
For more than 64,000 reads, a special version of phrap and cross_match
must be used. These may be found in the genome.work directory by
the names phrap.manyreads and cross_match.manyreads.
-
You would have to redirect phredPhrap and Consed to use these versions.
Note the implementation of the longreads alternative below to see how that
might be done.
-
Alternatively:
-
One or more contigs from a partial assembly may be presented to the next
assembly as if they were a each a single long gel read with the quality
values set to the consensus quality of the prior assembly.
-
This is particularly useful for integrating pyrosequencing data, which
has to be assembled by a different alignment program anyway. Typically
the pyrosequence data would come assembled into contigs from the contractor
with an asociated quality values file.
-
The program fastaq2phd, obtained from http://www.genome.ou.edu/informatics.html,
may be used to convert these contigs to individual phd files with quality
values embedded.
-
fastaq2phd is in /home/hardies/genome/bin, so it should be in your path.
-
The phd files thus created will not have associated chromatogram files.
For pyrosequencing data, there are no chromoatograms. If you represent
your own prior assemblies this way, you'd have to go back to the project
where that assembly was made and view the chromatograms there if you wanted
to see them.
-
To process datasets with long gel reads synthesized in this way, one needs
to access a specially compiled version of phrap for longreads.
-
In the /home/hardies/apps/genome system, starting phredPhrap with phredPhrap_longreads,
or starting consed with consed_longreads switches the longread version
of phrap into the path. Note, these are just wrappers that alter
the environment and put some symlinks into the path and then pass the argument
list off to the regular versions of phredPhrap and consed. You could
copy this arrangement to make a .manyreads implementation.
-
Repeat screening.
-
I'll add some documentation about this presently. SCH
Last updated 07/09/2008- Steve Hardies