An example of exploring resources for protein
analysis from a structural standpoint..
rab4b - a ras like protein highly conserved in mammals.
The entry at NCBI obtained by typing NP_001003275 in the Entrez Protein
Query box produced this GenPept
report.
-
Following Blink you find precomputed results
-
Note: Q91ZR1 is a SwissProt entry. (I can tell from the compact ID
starting with a letter). SwissProt entries are more likely to have
useful documentation. Clicking on the number to the left of Q91ZR1
brings up a blast of the two sequences (in this case versus itself).
Notice the annotation of features in the alignment.
-
There is also a CDD hit at the top (Conserved Domain Database) to a family
named cd00154.
-
In this case there are multiple related families indicated, derived from
different family databases by different curators based on different criteria.
-
There has probably been an explicit or implicit attempt to keep paralogues
in different families due to the implication that they have specialized
functions.
-
There may be useful Medline references given in each of the families.
-
There are actually two closely related structures (both from rat rab3A).
Note the PDB-like IDs in the alignment and the comment shown on mouse-over
of the sequence ID.
-
Upon following the structure link, the sequence viewer clarifies which
protein structure is actually displayed.
-
The architecture link goes to cDART, but cDART is better accessed through
the "show domain relatives" link on the initial CDD page.
-
The cDART display shows proteins with the target domain, related family
domains, and progressively more vaguely related family domains grouped
according to common domain structure of the proteins carrying them.
-
In this case, cDART shows that there are a collection of small rab-like
proteins that carry a small SOC domain in the C terminus. You might want
to explore if there were any analogous function hiding in the unannotated
C terminus of rab4b.
-
Note that db_xref links in the feature table itself also points to these
facilities.
-
The 3D Structure button on the Blink page gives a list of many related
3D structures.
-
One of the several COGs families listed on the CDD page is also directly
linked on the Blink page.
-
COGs was put together with the idea that paralogous families already distinct
in the common ancestor to procaryotes, eucaryotes, and archea could be
defined. For eucaryotes, they tend to limit themselves to fungi.
-
Note that there is a tree color coded by taxonomic source on the COGs page.
Doing it yourself
-
Cut/paste the sequence from NP_001003275 into the NCBI PsiBlast search
page. Check the box requesting CDD search and ...
-
Note that the search of the CDD database brings up the same CDD record
as was listed on the Blink page.
-
Click the Format button, and await your
result.
-
Note that several of the resulting matches are tagged by buttons to link
to other information. The 3 entries with red "S" buttons link to
structures.
-
In this case the numbers of matching sequences is so large, that the limit
of 500 to display is cutting off many more matches with structural information.
To get more structures repeat the search, but switch the database
from the default of "nr" to "pdb".
-
This result
has many more structure entries.
-
In many cases, the initial BlastP search will have found homologues, but
none with the sort of characterization you were hoping for. In that case
Psi-Blast can search deeper. We will simulate that situation by seeing
if more structures of a divergent nature can be found searching only within
pdb.
-
Click the "iteration 2" button in Psi-Blast. Go to the Blast search
page on your task bar and click "format" again. Now go back the the
page on the taskbar that had opened with the initial result.
-
Scroll down the new
result page to see the additional entries found marked "new".
-
This process can be repeated through further iterations. However,
the user must beware that false positives may be found and may eventually
take over the family.
-
The Blast page has many capabilities for limiting the scope of a search.
For example, we know that ras appears in viruses. We may want to
ask if anything closer to rab4b than ras appears in viruses.
-
The Blast Format page has a box for Entrez query language, and a drop down
box for common subdivisions that can be used to redisplay a search result
limited to just certain kinds of sequences. But in this case,
the initial search was so overwhelmed with finds that we should repeat
the search asking only to search viruses in the first place.
-
On the initial blast search page, where it says "limit by Entrez query",
click viruses[orgn] in the adjacent selection box.
-
Repeat the search.
-
From the result,
you can see that all the matches in viruses are at a similar level of divergence
as classical v-ras.
Taking it to other databases:
-
PDB. NCBI draws on the pdb database, but its pages reflect
its internal reorganization of the PDB data.
-
From Blast listings of pdb entries, this is simple, because NCBI uses the
same accession numbers as PDB in this context.
-
Just go to http://www.rcsb.org and put
1oiv in the search box. You will be presented with many more options
for viewing and retrieving the data in alternative formats. The new
version of the site will do a DSSP calculation (evaluate the secondary
structure according to an objective standard).
-
Similarly the structure cited as the structural reference for family cd00154
(3RAB_A) can be found at PDB by typing in the pdb number 3rab.
-
Pfam - Pfam is the premier protein database which if found at a
number of mirror sites. Note that cd00154 cited pfam00071 and pfam00025
as related families.
-
Go to the Sanger Pfam site at http://www.sanger.ac.uk/Software/Pfam/
-
Put the mouse on the "Browse by" item on the upper toolbar, select Pfam
family ID, and type PF00071 in the query box to access pfam00071
-
Note that this is a much more inclusive family than the small one found
in cd00154
-
Some facilities to note are:
-
A neighbor joining tree can be retrieved.
-
Although the java applet view seems to be nonfunctional at this time, a
standard tree description file can be downloaded. That file when
run through the Phylip drawgram program at the Bioinformatics center produced
this
tree.
-
The Pandit database linked in the database section is devoted to showing
trees, and its tree viewing applet works..
-
The ras
alignment can be viewed as an HMM logo instead of a sequence alignment.
-
The logo gives the relative frequency of each residue in the alignment
at each position by the size of the letter
-
The HMM logo also gives a measure of the acceptability of insertions and
deletions at each position.
-
If one realigned specific clades (eg, rab vs., ran, vs. ras) and constructed
HMM models through the use of HMMER (or SAM), those models can be converted
to HMM logos at http://logos.molgen.mpg.de/. These then provide a
compact means of visual comparison of the differences between the two families.
-
More information and resources for making logos can be found at http://www.lecb.ncifcrf.gov/~toms/
-
The distribution of the family in relation to the tree of organisms is
given, and sequences can be viewed in alignment by taxonomic identity of
the organism containing them.
-
Structural references into PDB are given. The various structure viewers
and sites given have a variety of extra kinds of information
-
PDBSUM gives you a Ramachandran Plot and analysis.
-
The domain architecture tool gives a readable accounting of the distribution
of this domain into different multidomain contexts.
-
Among the other database links:
-
FUNSHIFT attempts to detect differences between subfamilies in the evolutionary
constraints at individual residues.
-
HOMESTRAD shows information about structure alignment.
-
SYSTERS and SCOP deal with various conceptual levels of groupings into
superfamilies.
-
A problem working with a combination of NCBI and European databases is
that they do not use the same names for genes. The names listed in
Pfam arise from the SwissProt database, which is now revising itself to
be called UniProt.
-
The NP_001003275 sequence no doubt appears in UniProt, and probably in
Pfam alignments, but the name does not.
-
Often the simplest way to find the sequence in another database system
is to do a sequence search, letting the sequence be its own identifier.
-
The search
facility at PIR is helpful for switching between the two systems.
-
The SAM
server will take your sequence and return:
-
A well aligned expanded family derived from the current database and capable
of including highly diverged members.
-
A search against their family model database.
-
A SAM HMM model that could be used for further searching in the local SAM
implementation
-
A variety of secondary structure predictions.
-
Sequence logos and secondary structure logos.
3/13/2005 - Steve Hardies