ORF finding and start site refinement:
This history of phage genomics is full of additional genes
being recognized after the initial publication. Leaving a gene out
causes the protein sequence to not appear in GenBank. Hence the opportunity
for future investigators to match it with a Blast search and recognize
it as a conserved protein is denied. On the other hand, there are
not serious consequences of false positive gene predictions. Hence
we use a fairly thorough prediction scheme to try to avoid leaving out
any plausible genes.
-
Collate results from multiple frame prediction programs utilizing:
-
Length of reading frame.
-
Codon usage.
-
Quality of ribosome binding signal.
-
Relation to adjacent frames.
-
Conservation of sequence with identifiable homologues.
-
Conservation of N terminus with identifiable homologues.
-
Spacing relative to identifiable promoters, transcription
terminators, or other identifiable noncoding features.
Gene identification (what kind of gene is it?):
-
UTHSCSA bioinformatics environment:
-
Psi-Blast.
-
Psi-Blast, which is easily done directly at NCBI, goes beyond
Blast in the case where one or more homologues is found, but none of them
have an identified function. Psi-Blast constructs a key from the
first round finds (a PSSM) that learns from the residues in the close homologues
more about what residues to allow at each position. It then can be
used to search again and see through deeper divergence. Psi-Blast
was able to form a PSSM between VpV262 orfG and SIO1 sequences and then
see deeper to match a T7 gene.
-
Identification
of VpV262 orfG as maturase/packaging enzyme/terminase.
-
Focused Blast
-
If one restricts the database searched, say to only viruses,
or to only a particular virus, the signal to noise is greatly improved.
But the search is now biased. The reported E value increasingly incorporates
the assumption that there is a gene to find in the restricted set.
-
Note: in the VpV262 paper, we described how to do this directly
at NCBI. NCBI has since changed the operation of its site so that
this strategy does not work there. It can be executed with a local
implementation of the NCBI Blast Toolkit.
-
Family database searches can be done at NCBI and at various
family databases.
-
These use a library of PSSMs (or related structures) previously
constructed from alignments of known protein families. These may
be able to see across deep divergence to recognize a gene in the new target
virus.
-
Reverse Psi-Blast.
-
One can construct family searches on the fly by keying a
Psi-Blast search with a suspected distant homologue, and allowing it to
form its own PSSM. It may then see across the divergence to pick
out a gene in the target virus in later iterations.
-
This requires a local implementation of Blast and the Blast
libraries so that the target virus genes can be incorporated into the library
(assuming the target virus is not yet submitted to GenBank).
-
Focused Reverse Psi-Blast
-
The library can also be restricted in the revese Psi-Blast
strategy to increase sensitivity.
-
Rank order statistics
-
When using a severely restricted library, the Blast E value
is badly biases and becomes unreliable. The following example describes
a different statistic that may be used to validate the results of such
a search. It requires a prior hypothesis about which gene in a distant
virus is considered to be the likely homologue.
-
Identification
of VpV262 portal and capsid genes.