Objectives:
The term "bioinformatics" is used by different people to mean completely different things. It became popular after the human genome project promoted dumping lots of unanalyzed data into public databases, causing the problems of gathering the data and using the data to fall on different people having different skills. Hence for many, "bioinformatics" means knowing how to dig information out of public databases, or how to run programs that analyze such data, or knowing how to write programs that analyze data, or knowing how to create web sites that run programs for other people to analyzie such data, or knowing how to create more databases with web interfaces that have results of analyzing such data, etc. The term has been co-opted to include anything from medline searching, to maintaining hospital records, to teaching classes over the internet.
Nature of sequence databases
Protein sequence databases
Early in the development of sequence databases, any DNA or protein sequences were accepted by one of several entities that were collecting data and an identifier was assigned. When the different databases began exchanging data, this resulted in multiple cases where the same name was given to different sequences, and many cases of multiple entries of the same sequences with different names. Eventually an additional name called an accession number (which always begins with some letters) was assigned to try to make sure each sequence has a different name. Later, yet again another number (the gi number, which is always a pure number) was assigned in an attempt to give a unique number to each sequence. For proteins, even more curated databases sprung up that assigned yet again more identifiers (and gi numbers) to each curated sequence. The sequence centers that hosted search engines wanted to reduce the redundancy to reduce search times. They used computer searches to identify all cases where sequences were identical in length and sequence and reduced them to one copy. These were called "nonredundant" databases. However, the enhanced documentation in the curated databases is too useful to discard. In its current form nonredundant databases consist of a single sequence of each redundant set linked to multiple sets of documentation.
Recently the protein database at NCBI has been split into divisions. See divisions in NCBI databases.
An example of the redundant entries in the protein nr database retrieved by a Blast Search at NCBI (in 2003):
>gi|131791|sp|P22750|RB4B_HUMAN Ras-related protein Rab-4B
gi|108108|pir||F36364 GTP-binding protein rab4b - dog
gi|919|emb|CAA39800.1| rab4b [Canis familiaris]
gi|5640004|gb|AAD45923.1|AF165522_1 ras-related GTP-binding
protein 4b [Homo sapiens]
gi|20379046|gb|AAM21083.1|AF498935_1 small GTP binding protein
RAB4B [Homo sapiens]
gi|28422140|gb|AAH46927.1| Unknown (protein for MGC:52123)
[Homo sapiens]
In 2005 the same search produces:
>gi|54792729|ref|NP_001003275.1| rab4b GTP-binding protein [Canis
familiaris]
gi|28422140|gb|AAH46927.1| RAB4B protein [Homo sapiens]
gi|20379046|gb|AAM21083.1| small GTP binding protein RAB4B
[Homo sapiens]
gi|919|emb|CAA39800.1| rab4b [Canis familiaris]
gi|46577635|sp|P61018|RB4B_HUMAN Ras-related protein Rab-4B
gi|46577634|sp|P61017|RB4B_CANFA Ras-related protein Rab-4B
gi|5640004|gb|AAD45923.1| ras-related GTP-binding protein
4b [Homo sapiens]
gi|108108|pir||F36364 GTP-binding protein rab4b - dog
Anything that makes the sequence different causes a different entry, even if it is logically the same protein. Examples are mutants (natural or artificial), polymorphisms, fragments, alternatively spliced variants, sequencing errors, or misdiagnosed start codons. On the other hand, a protein of the same sequence isolated from two different species would be categorized as redundant. It would appear from the entries above that human and dog rab4b have identical sequences.
As entire genomes were sequenced, the curated databases tried to incorporate one complete set of proteins. However, keeping in mind that many of the genes are predicted, different curated databases may not completely agree on what to call genes in a given genome. NCBI began its own curated division called RefSeq.
Note in the example above, that by 2005 a RefSeq entry has appeared for this sequence (gi|54792729|ref...). The SwissProt database entry (...|sp|...) has been replaced by two entries with identical sequence, one attributed to humans and one to dogs. Curiously, NCBI's curators didn't mention anything about human rab4b in their entry, but by looking down the list of blast matches a ref entry for human rab4b appears that has a completely different 40 residue N terminus, but otherwise is identical to the sequence above. This could have something to do with alternative splicing, or a difference in opinion among different database curators about where the start codon might be. The citation given in the human RefSeq entry is to a paper about a collection of 15,000 "full length cDNAs". This is unlikely to have any specific information about human rab4b, but does imply that someone isolated a cDNA that seemed to encode the extra 40 residues. A cursory search of medline did not reveal commentary about alternative splicing. Further commentary on defining the struture of this gene found in the tutorial for researching human gene structure posted in the course bioinformatics tutorial page.
Nucleotide Databases
Nucleotide databases originally developed along the same lines, but then some kinds of data became segregated so as to only be searched separately. ESTs and STSs were given their own divisions. When sequencing of whole genomes began, the unfinished data was kept separate. For some genomes, the unfinished data is kept at the sequencing center and only finished or near-finished data is deposited into the public nucleotide databases. Some private enterprises never deposited their genomes in the public databases. Public sequencing centers were provided a separate division to deposit unfinished data into, called htgs, and then later another called wgs. For unfinished genomes, if is often better to find a genome center which may provide a coherent partial assembly, called a "draft" or "freeze" for blast searching.
For more on the divisions of the NCBI databases see divisions
in the local version of NCBI's sequence databases.
The basic Blast suite of programs:
BLASTP: compares amino acid query to protein sequence database
BLASTN: compares DNA query to DNA database
BLASTX: translates DNA query to all 6 frames and searches a protein
database.
TBLASTN: compares an amino acid query against all six frames of translated
DNA database entries.
TBLASTX: compares all 6 frames of a query against all six frames of
a DNA database.
Other search programs:
The BLAST family replaces a program named FASTA (Lipman and Pearson (1985) Science 227, 1435-1441.) Whereas FASTA produced a score for the goodness of the match, but didn't normalize the score for the size of the database. So the minimum score required for confidence in the match kept changing as the databases grew. Blast automatically normalizes its score to the size of the database. Hence it reports an E value, which is the expectation for the number of time a sequence matching as well as the match under examination should be found by random chance. Further advancements were to include postprocessing to improve gap placement.
A major advance was the introduction of Psi-Blast, which automatically compiles a profile (called a Position-Specific Sequence Matrix or PSSM) from the first round of BlastP matches and uses it to search for more divergent homologues.
Dynamic programming alignment methods.eg. Needleman-Wunsch Alignment and many variations thereafter. (Needleman
and Wunsch, J. Mol. Biol. 48, 443 (1970)).
This and related algorithms find an optimum alignment between two sequences
in the following mathematical sense:
Dotplotters
Programs that make graphical displays of wherever seq. 1 matches at x/y positions to a position in seq. 2. The distinction between signal and noise is left up to your visual perception of a trace of dots in a line in the midst of a random pattern of dots. These programs are very helpful in displaying internal repetitiveness within and between sequences. They are generally only useful for comparing 2 relatively short sequences. NCBI's blast 2 sequence program returns a small dotplot.
This is a comparison from NCBI of a mRNA containing a repetitive sequence with its chromosomal segment.
A web site for making dotplots is http://athena.bioc.uvic.ca/pbr/jdotter/browse.html. You have to have Java installed on your computer, and you apparently have to strip all headers, spaces, numbers, and carriage returns out of your sequence before you paste it in the box.
Screening for repetitive DNA.
If your sequence query contains a repetitive DNA, the long list of matches
to that portion may obscure a more subtle match to the nonrepetitive component
of the sequence. A web site designed to help you identify and remove repetitive
sequences is http://ftp.genome.washington.edu/cgi-bin/RepeatMasker.
Detection of features by comparison of multiple sequences.
P = 14% of finding a 9/9 chance match between any two sequences within 200 bp of the TATA boxes of two different promoters. However finding the same sequence in a 3rd promoter dramatically enhances confidence.
Also, limiting the search area to the sequence underneath a footprint dramatically enhances the significance of a sequence match. For example, here are the sequences found underneath the footprints of the zeste protein on the bithorax promoter:
Binding site Orientation Sequence
Z5 Antisense tctcttTGAGTGttcg
Z4 Antisense gttttTGAGCGctct
Z3 Sense tttTGAGCGtttgcg
Z2 Antisense taaaaaCGAGTGgaaaac
Z1 Sense CGAGTGAGTTGAGTc
Consensus
YGAGYG
From Biggin et al. (1988) Cell 53, 713-722.
Two kinds of evolution create sequence similarity.
Similarity between two sequences can be generated by two evolutionary processes (besides random chance):
1. Divergent evolution - descent from a common ancestor. Sequences that are similar for this reason are said to be homologous. This is the kind of evolution that relates proteins or protein domains into families and superfamilies. Homologues can arise as the same gene descends into different species of organism, or because of a gene duplication allowing two versions of a gene to diverge within the same organism. Genes related by speciation but not by duplication are called "orthologues". Genes separated by a duplication are called "paralogues". Paralogues remaining in the same genome carry the special implication that they have functionally specialized. Truly redundant genes (meaning that one of them can be lost without affecting the fitness of the organism) do not last in evolution. However, the specialized contribution of each gene could be subtle, and may be confined to the expression profile rather than to the enzymatic properties of the protein.
During divergent evolution, different residues have a different propensity to change over time. Even protein sequences that have lost their function (called pseudogenes) take a certain amount of time to lose similarity. The extent to which functioning genes retain similarity longer than pseudogenes do is called "conservation". The idea behind defining a motif from a multiple sequence alignment is to identify residues that are not changing among multiple independent comparisons, and hence have predictive power. In other words, you should reach a point where adding additional family members does not change the set of conserved residues. However, there are very few truly invariant residues in proteins. So eventually one discovers even more diverged family members that have a more relaxed version of the motif. If a distantly related protein matches the motif due to conservation rather than chance, then closer relatives of that protein will also retain the key residues of the motif.
Measuring the distance between divergent protein sequences such that it would be proportional to the evolutionary time passed is a difficult problem. Ignoring the possibility that the intrinsic mutational process may not have been constant in all lineages at all times, there is the more fundamental problem that as divergence passes, changes fall on top of previous changes and cease to be counted as additional differences. There are a number of mathematical corrections that are applied to the problem, but they all have increasing uncertainty at high divergence. Difficulty in relating divergence to time passed often expresses itself as uncertainty about the root of a tree.
Divergent evolution often proceeds to the point where there is no detectable sequence similarity, but the 3-D structures are still very similar. However, recent improvements in the use of profiles and Hidden Markov Models is bringing recognition of divergent sequence similarity back into the range of fold recognition.
2. Convergent evolution - two different sequences become similar because they are selected for a common function.
Similarity between promoters or transcription factor binding sites is usually convergent evolution. So one shouldn't call them homologous unless you really mean that they arose from a gene duplication. The relatedness among different alpha helices enforced by the preference for some residues over others in a helix is another example of convergent evolution.
There are usually not definable subsets of convergently evolved motifs. It may be true that better adherence to the consensus motif means stronger function, but this is not necessarily true.
Representation of motifs
1. Most simply as a pattern or signature, like TATA for the TATA box. Search programs will allow ambiguity characters and a provision for gaps of various lengths.
This method is used for restriction site searching. There is no standardized format. Comprehensive datafiles of restriction enzyme cleavage sites formatted for many different programs are found at http://rebase.neb.com/. One can get restriction site searches done at web sites. eg. http://rna.lundberg.gu.se/cutter2/
Protein motifs and transcription factor binding sites were originally defined in this way. There is a formalized syntax specified within a protein family database called PROSITE. The rules are defined at http://www.expasy.org/tools/scanprosite/scanprosite-doc.html.
(Hoffmann K, Bucher P, Falquet L, Bairoch A. 1999. The prosite database,
its status in 1999, Nucl. Acids Res. 27: 215-219; http://www.expasy.org/prosite/).
Here is an example of Zinc finger motifs:

Prosite's version of the zinc finger is broken into 5 subclasses:
Zinc finger, C2H2 type
C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H.
Zinc finger, C3HC4 type (RING finger):
C-x-H-x-[LIVMFY]-C-x(2)-C-[LIVMYA]
GATA-type zinc finger
C-x-[DN]-C-x(4,5)-[ST]-x(2)-W-[HR]-[RK]-x(3)-[GN]-x(3,4)-C- N-[AS]-C
Poly(ADP-ribose) polymerase zinc finger
C-[KR]-x-C-x(3)-I-x-K-x(3)-[RG]-x(16,18)-W-[FYH]-H-x(2)-C
Prokaryotic dksA/traR C4-type zinc finger.
C-[DES]-x-C-x(3)-I-x(3)-R-x(4)-P-x(4)-C-x(2)-C
Lots of programs use this format including MOTIFS in the GCG package. Several web sites will search against this database including http://www.expasy.org/tools/scanprosite/. See PHI-Blast and regular expression searching.
2. As a weight matrix. An individual matching matrix is defined for each position in the substring. The scores in the matrix are usually the relative frequency of finding each possible residue in that position in a large set of family members sharing the motif.
For example:
Frequencies of different bases out of 112 procaryotic promoters.
----------------------------------------------------------------------------
Pos. -23 -22 -21 -20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5
T 35 28 28 27 39 51 34 43 26 31 89 3 49 15 19 108 31 29 21
C 34 21 24 27 12 25 20 25 20 27 10 2 16 14 22 3 13 16 30
A 20 39 33 33 39 23 29 16 23 19 2 106 29 66 57 1 35 23 31
G 23 24 27 25 22 13 29 28 43 35 11 1 18 17 14 0 33 24 30
----------------------------------------------------------------------------
Ref: R. Staden (1984) Nucl. Acids Res. 12, 505-519.
TRANSFAC is a transcription factor database that searches for binding sites by a matrix driven approach. See http://www.gene-regulation.com/ for TRANSFAC database and several search programs.
3. As a profile: A weight matrix but with individualized gap penalties for each position in the substring. Many protein motifs and DNA motifs have been converted into the profile form.
Profiles have evolved into Hidden Markov Models. See Comparison of Profile-Based Methods.
Neural Nets
A new interesting kind of computer system for pattern recognition is called a neural net (because it's supposed to simulate how your brain works). It consists of several layers of weighting tables separating the input (eg. amino acid sequence) from the output (eg. a decision on whether the sequence is alpha helix, beta sheet, beta turn, etc).
In the diagram above, each amino acid in the sequence is represented by a number which is transmitted to the first layer of nodes in the network. Each node has a set of weighting factors that it uses to sum up the input and compute its own internal value which is transmitted to the next layer of nodes. Here, another set of weighting factors is used to compute the second layer values for transmission to the final layer. Here, after application of a final set of weighting factors, the result should indicate the likelihood of that particular outcome.
The key is to get all of the weighting factors set so that each possible amino acid sequence is translated to its correct result. The program sets the weights itself during a training session where it is presented with a variety of sequences for which the output is known. Neural nets are applicable to a wide variety of pattern recognition functions. They are particularly effect at recognizing fuzzy, partial, or obscured patterns. Notice that the "pattern" is never formulated as a discrete rule. Examples of neural net recognition programs are:
GRAIL (Genomics 24:133-6, 1994). Searches for exons.
Open reading frames.
Most simply long open reading frames might be considered significant based on:
P = 1-[1-e-Lp]2n, where P is probability that
it's a chance match, L is length of open frame observed in codons, n is
number of nucleotides examined, and p is the probability of a terminator
occurring at any specific position by chance.
Expectation is E=2n(e-Lp).
Essentially the above approach accumulates a fixed weight for each consecutive codon that is not a terminator. Improvements on this can be made by the use of prior information about how triplets are distributed in known genes of this organism. So a higher weight can be given based on triplets that conform to the usual codon preference of the organism, and lower weights can be given to triplets that tend to accumulate in noncoding sequences. The method of readjusting the probability for a frame n long by figuring in the weight of another codon to make it n+1 long is mathematically complicated, and goes under the name Bayesian Statistics. The weight table is called a Hidden Markov Matrix, or Hidden Markov Model (HMM). HMM methods may improve further by adding weight depending on how well the ends of the frame conform to the norms for translation initiation signals or splice sites for that organism.
A problem with statistical methods is that genes vary nonrandomly from the norm. For example, weakly expressed genes may both exhibit little codon preference and have poorly recognizable control signals.
Other kinds of data that may be included in predicting frames includes sequence similarity to other known proteins, conservation of sequence to related organisms in a way that preserves the open frame, the existence of EST clones in the region, and the relation to gene predictions in other frames.
Biochemical testing of the meaning of an open reading frame
Biochemical testing of whether open reading frames encode a protein can be done by synthesizing peptides corresponding to the putative encoded protein, making antibodies, and using the antibodies to search for the protein.
Ref: Mariottini et al., PNAS 83, 1563-1567.
Northern Blotting (or inclusion in a microarray) can be used to test if a sequence is expressed, although to some extent pseudogenes and intergenic DNA can be transcribed.
Evaluating confidence in a computational result.
1. Overall similarity of > 30% at the amino acid level over the length of a typical protein is a clear indication of homology. Greater than 70% similarity over hundreds of base pairs of DNA sequence is a clear indication of homology. For short matches or matches of lower quality, the amount of sequence searched has a tremendous impact on the likelihood of chance finds. Generally the probability and expectation of chance matches grows with the n characters searched against m characters as follows.
P(nm)= 1-[1-P(1)]nm,
E(nm)=nmP(1)
E values (expectation), when they are << 1 are essentially the same as P values (ie., they estimate a small probability that the result could be by random chance). The difference is that E values can be > 1. An E value of 10 means that not only do you expect to find such a result by random chance, you expect to find it 10 x in this particular computation.
Starting with Blast, E values are generally corrected for the size of the search key and the database. Prior knowledge is built into Blast both in the form of the Blossum matching matrix and in some compositional corrections to protein searches. However, it is still somewhat risky to interpret Blast E values to better than a factor of 100 to 1000.
2. Once you start allowing gaps, the probability of of chance matches becomes hard to calculate, but clearly can become greatly improved. Expectation of x bp match containing y mismatches between 2 DNA sequences of lengths m and n:
E=nm(3yx!)/(4x(x-y)!y!); compared to nm/4x
3. Nonrandom base composition gives a higher probability of chance matches. Randomization controls can be used to evaluate chance matching for any base (or residue) composition. These are called Monte-Carlo calculations.
4. Programs using sophisticated pattern recognition algorithms should come with documentation about how well they perform. A proper evaluation requires a test set of sequences known to be true positives or true negatives by some outside criterion, and not involved in any way in deriving the parameters used by the program. The sequences used to derive parameters used by the program are called a training set. Be mindful of overly optimistic performance evaluation based on how well the program performs on its training set. Be skeptical of any claim of better than 70% success in identifying sequence features.
5. Always use positive and negative controls consisting of sequences that you feel sure about being truly positive and negative for the kinds of features you are searching for. You can also do a full database search and notice the quality of matches to sequences that you feel sure shouldn't meaningfully match the query.
6. For databases of precomputed results, first look up their results for a range of items that you believe to be established results. Make note of whether the database attempts to discriminate between signal and noise, or just gives a list of results including noise.
7. Every time you enter sequence into a new program, verify the beginning, end, and length of the sequence as the program saw it. Format problems are notorious for garbling the sequence on the way in. There are almost always lengthlimits, and programs often just throw away the end of your sequence without warning. This is especially true about web pages. Upload boxes often have higher size limits than text boxes.
8. Compare multiple tools. Often a consensus among different programs
is more reliable than any single one of them.