The gene in question was picked as a gene showing up in a Blast search keyed by ras and illustrating how the gene from two different species but of identical sequence is collapsed to one entry in NCBI's nr database:
Notice from this nr header (3/2005) that the SwissProt annotators think the human and dog orthologs are identical, but NCBI's RefSeq annotators think otherwise:
>gi|54792729|ref|NP_001003275.1| rab4b GTP-binding protein [Canis
familiaris]
gi|28422140|gb|AAH46927.1| RAB4B protein [Homo sapiens]
gi|20379046|gb|AAM21083.1| small GTP binding protein RAB4B
[Homo sapiens]
gi|919|emb|CAA39800.1| rab4b [Canis familiaris]
gi|46577635|sp|P61018|RB4B_HUMAN Ras-related protein Rab-4B
gi|46577634|sp|P61017|RB4B_CANFA Ras-related protein Rab-4B
gi|5640004|gb|AAD45923.1| ras-related GTP-binding protein
4b [Homo sapiens]
gi|108108|pir||F36364 GTP-binding protein rab4b - dog
The RefSeq entry for human rab4b would appear to specify a different N terminal:
>gi|21361509|ref|NP_057238.2| ras-related GTP-binding protein 4b
[Homo sapiens]
gi|10441901|gb|AAG17228.1| unknown [Homo sapiens]
Length =
248
Score = 432 bits (1112), Expect = e-120
Identities = 208/208 (100%), Positives = 208/208 (100%)
Query: 6 DFLFKFLVIGSAGTGKSCLLHQFIENKFKQDSNHTIGVEFGSRVVNVGGKTVKLQIWDTA
65
DFLFKFLVIGSAGTGKSCLLHQFIENKFKQDSNHTIGVEFGSRVVNVGGKTVKLQIWDTA
Sbjct: 41 DFLFKFLVIGSAGTGKSCLLHQFIENKFKQDSNHTIGVEFGSRVVNVGGKTVKLQIWDTA
100
Query: 66 GQERFRSVTRSYYRGAAGALLVYDITSRETYNSLAAWLTDARTLASPNIVVILCGNKKDL
125
GQERFRSVTRSYYRGAAGALLVYDITSRETYNSLAAWLTDARTLASPNIVVILCGNKKDL
Sbjct: 101 GQERFRSVTRSYYRGAAGALLVYDITSRETYNSLAAWLTDARTLASPNIVVILCGNKKDL
160
Query: 126 DPEREVTFLEASRFAQENELMFLETSALTGENVEEAFLKCARTILNKIDSGELDPERMGS
185
DPEREVTFLEASRFAQENELMFLETSALTGENVEEAFLKCARTILNKIDSGELDPERMGS
Sbjct: 161 DPEREVTFLEASRFAQENELMFLETSALTGENVEEAFLKCARTILNKIDSGELDPERMGS
220
Query: 186 GIQYGDASLRQLRQPRSAQAVAPQPCGC 213
GIQYGDASLRQLRQPRSAQAVAPQPCGC
Sbjct: 221 GIQYGDASLRQLRQPRSAQAVAPQPCGC 248
The sequence of 21361509:
1 msvslpltvm vrerdwigih
lfslylslpv gipdfgsiws dflfkflvig sagtgkscll
61 hqfienkfkq dsnhtigvef gsrvvnvggk
tvklqiwdta gqerfrsvtr syyrgaagal
121 lvyditsret ynslaawltd artlaspniv
vilcgnkkdl dperevtfle asrfaqenel
181 mfletsaltg enveeaflkc artilnkids
geldpermgs giqygdaslr qlrqprsaqa
241 vapqpcgc
As opposed to the N terminus in dog and the Swiss Prot human version:
MAETYDFLFKFLVIGSAGTGKSCLLHQFIENKFKQDSNHTIGVEFGSRVVNVGGKTVKLQ
The human MSVSL variant is annotated to come from NM_016154.2:857..1603
Note that NM_016154.2 is a 2nd version. Following back to NM_016154.1 reveals an entry submitted by an actual author (although the paper is unpublished) and having a different splicing pattern and the MAETY N terminus.
Blastn against chromosome division, with entries limited to "homo sapiens"[orgn]
Wasn't contiguous: had to mask repeated sequences to keep it from blowing up.
NG_000008 was the genomic segment cited by the gene entry:
Searching the Features in NG_000008 revealed this position for rab4b
CDS
join(3101..3178,3464..3578,6857..6919,7000..7154,
9744..9839,9927..10057,19649..19652)
/gene="RAB4B"
However, examining these coordinates reveals that the annotator got them wrong. (The AA seq doesn't match).
Searching NM_016154.2 versus NC_000019 (Complete human chromosome
19) blows up unless repeats are filtered.
Generally speaking, you save yourself a lot of trouble if you use
repeatmasker first on any mammalian nucleic acid sequence prior to attempting
blast searches.
Repeat masker reveals that 1-800 in NM_16154.2 is an island of repetitive sequence.
Repeat Annotations:
SW perc perc perc query
position in query matching repeat
position in repeat
score div. del. ins. sequence begin end
(left) repeat class/family begin end
(left) ID
302 28.9 8.5 2.6 test
6 158 (1849) C L2 LINE/L2
(378) 2936 2775 1
784 22.7 5.0 4.6 test
161 375 (1632) + MIR SINE/MIR
10 226 (36) 2
440 10.6 0.0 8.6 test
376 420 (1587) + L1PA11 LINE/L1
6086 6127 (47) 3
2423 7.1 0.3 0.3 test
421 730 (1277) + AluSp SINE/Alu
1 310 (3) 4
440 10.6 0.0 8.6 test
731 777 (1230) + L1PA11 LINE/L1
6128 6170 (4) 3
784 22.7 5.0 4.6 test
778 801 (1206) + MIR SINE/MIR
227 250 (12) 2
189 0.0 0.0 0.0 test
1987 2007 (0) + (A)n Simple_repeat
1 21 (0) 5
Searching NM_0160154.2 versus just the first 20000 bp of NG_000008
(to cut down the amount of repetitive sequence hits) allows the following
approximate correct location of the exons as actually translated to make
the refseq protein entry.
The initiator was at 857, so no repetitive sequence was involved
in the translation.
6..702 matched 2126..2822
731..1058 matched 2851..3178
1059..1175 matched 3464..3580
1173..1336 matched 6856..6919
1234..1391 matched 6997..7154
1391..1492 matched 9743..9844
1485..1618 matched 9924..10057
1616..1970 matched 19646..20000
Blast 2 sequences suggests that the 5' ends of the NM_016154.1 and
NM_016154.2 messages do not overlap
Where is the first 138 bases of NM_016154.1 in MG_000008?
Query: 20 agtaggaaggagccgttgctgtagccggagtggagcggctgccagccgaggagcaggcgc
79
||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||
Sbjct: 1351 agtaggaaggagccggggctgtagccggagtggagcggctgccagccgaggagcaggcgc
1410
Query: 80 ggccgcggcgccatattgcggccctcagcggccgcgaccgagtcatggctgagacctac
138
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 1411 ggccgcggcgccatattgcggccctcagcggccgcgaccgagtcatggctgagacctac
1469
The initial 20 bases of NM_016154.1 were full of ambiguity codes, so the lack of match there is not meaningful.
NM_016154.1 is not repetitive according to repeatmasker (except for poly(a)).
Searching NM_016154.1 vs human ESTs reveals that there are human ESTs with this structure.
What about the NM_016154.2 arrangement?
A search of the human EST database with NM_016154.2 (with repetitive sequences [1..800] removed) shows that no ESTs have this arrangement. There are however mRNA entries in the non EST division
Conclusion: EST data do not support the splicing pattern given in RefSeq. Larger RNAs of various structure reported in GenBank may be unspliced nuclear RNA or rare splicing errors from the next upstream gene..
Here's the region at UCSD
genome browser:
Here's the region at ensemble.
Several of the ESTs are labeled as due to a 5' cloning strategy and
go back to 1308 on NG_000008. Let's take that as an approximate mRNA
start point.
The sequence is: 1 ccggaggggg gcggaggcgg aagtggcggt gccgggcccg gggagtagga
aggagccggg
Recording the sequence will assist finding exactly this place in versions
of the chromosome sequence that are numbered differently.
For example, searching this string with NCBI's Blast 2 sequences against
NC_000019 reveals that this proposed start site is at 45975974 in the current
version of the complete human chromosome 19 sequence, and oriented in the
positive direction.
Reviewing the UCSC genome browser reveals an estimate of the end of
the 3' UTR of the upstream gene as :
TGAgctcagcctaccg
ctggccctgccgtttcccctccttggctttatgcaaatacaatcagccca
gtgcaaa
ending at 45975235
So there is only about 740 bp to consider for the transcriptional controls.
Use Entrez to retrieve NC_000019, limited to 45975235..45975974
1 acggctcgtc
tccgtggtct ttggggtggg gtagggtagg gtggggactg
tacaaatgaa
61 atgtttctct
aggttgctga atctaaccaa ttaacccgct gcctgtggta
acgtcagtgg
121 ttgctaggca gagtttcact gatgaaagcc
ctgtgcagta ggagcgctcc taagcttagg
181 tttcggacac
aagcaaagga aaacctaagc agcccaacta ggggattgta gtgtcctctc
241 tagaccagtg ggagggagcc aatcggacta
cgcggaggat
ataaatcgca cagtaaaggt
301 tttcttggaa gattatctgg
aaagggaggt gggaagtaac ctgcgcctat ctgcccagtt
361 ttcctccttt tcgcctttga
gaacagtaat cgctcccgcc
agctcaagat cagctcctct
421 tccagcctct ttctgtcaat cctgcccgtg
cccccttctt tgcatttgta tcccctcccc
481 caagtccgct
ccgatccaat ccggagactc
gactctgccc cccgtactcc agactaaatc
541 cgttcctcgg
ctcagctaag cagctctgca cgtcccttcc cacttgctca
gccaatcgct
601 gggctcggcc
ccgccccttt gggcttggtc cttctcacgc
caccaatccg cgcctcttgt
661 cccgccccct
gcccgcgagt ccgccaatcc
cgcccttcgg
aaagcgtcgc ctggtatcca
721 gctgccggag
cgggtcgcgc
Note: the AATGAAA at 45 in this segment is the only candidate I see
for the poly(a) addition signal for gen MIA, narrowing the region down
furhter.
http://www.fruitfly.org/seq_tools/promoter.html gave the underlined
sequence as a predicted eucaryotic promoter.
http://genes.mit.edu/McPromoter.html didn't predict a promoter.
Turning on Gene expresion and CpG island tracks in the UCSC genome browser indicates other predictors of a promoter in this area.
The distribution of ATG is marked in blue above. The distribution of CG is marked in red. Since human messages usually start on the first AUG, the absence of ATG in the latter half of this sequence supports the idea of a longer 5' untranslated region. On the other hand, the CpG island in the bottom half of the sequence would suggest the possibility of a GC rich house keeping promoter in the bottom half. The relative inconsistencies among EST starts may indicate that there are multiple start sites and perhaps multiple promoters for this region.
If you click on the conservation track at UCSC, you will see the alignments of several mammalian genes through this area. Unfortunately, it is not numbered in a very helpful way. However, with a practiced eye you can find the various landmarks above. The TATAAAT, presumably the core of the predicted promoter as underlined, does not appear to be conserved, although the CCAAT box to the left of it may be. There is considerable gapping until one gets into the CG island. Thereafter there is little gapping, high conservation, and CG's are more often conserved across the alignment. This illustrates that the statistical properties of guessing a relevant site on a single sequence are poor, whereas guidance from cross species conservation can narrow down the possibilities and give you a better statistical chance of finding a real site.
The only conserved transcription factor binding site marked at USCS is a YY1 site at 45976082. (CGGCGCCATATTGCGGC) YY1 sites are usually around the start of transcription. There are indeed a number of ESTs starting around that site, so we should probably revise our major start site to the right and regard the slightly longer ESTs as upstream starts.
There is an excellent tutorial
on the use of the TRANSFAC database and other resources for exploring promoters
written by Enrique Blanco, Genome BioInformatics Research Lab, IMIM..
An example result posted there illustrates the high density of false positives
you can expect without some way of narrowing down the options. Other
methods of narrowing down the possibilities include biochemical footprinting,
and having some knowledge of which transcription factors are likely to
be present in the relevant cell types.
3/4/2005 - Steve Hardies
NM_016154.2
1: NM_016154. Reports Homo sapiens RAB4...[gi:21361508] Links
LOCUS NM_016154
1208 bp mRNA linear PRI
18-DEC-2004
DEFINITION Homo sapiens RAB4B, member RAS oncogene family
(RAB4B), mRNA.
ACCESSION NM_016154 REGION: 800..2007
VERSION NM_016154.2 GI:21361508
KEYWORDS .
SOURCE Homo sapiens (human)
ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
REFERENCE 1 (bases 1 to 1208)
AUTHORS Strausberg,R.L., Feingold,E.A., Grouse,L.H.,
Derge,J.G.,
Klausner,R.D., Collins,F.S., Wagner,L., Shenmen,C.M., Schuler,G.D.,
Altschul,S.F., Zeeberg,B., Buetow,K.H., Schaefer,C.F., Bhat,N.K.,
Hopkins,R.F., Jordan,H., Moore,T., Max,S.I., Wang,J., Hsieh,F.,
Diatchenko,L., Marusina,K., Farmer,A.A., Rubin,G.M., Hong,L.,
Stapleton,M., Soares,M.B., Bonaldo,M.F., Casavant,T.L.,
Scheetz,T.E., Brownstein,M.J., Usdin,T.B., Toshiyuki,S.,
Carninci,P., Prange,C., Raha,S.S., Loquellano,N.A., Peters,G.J.,
Abramson,R.D., Mullahy,S.J., Bosak,S.A., McEwan,P.J.,
McKernan,K.J., Malek,J.A., Gunaratne,P.H., Richards,S.,
Worley,K.C., Hale,S., Garcia,A.M., Gay,L.J., Hulyk,S.W.,
Villalon,D.K., Muzny,D.M., Sodergren,E.J., Lu,X., Gibbs,R.A.,
Fahey,J., Helton,E., Ketteman,M., Madan,A., Rodrigues,S.,
Sanchez,A., Whiting,M., Madan,A., Young,A.C., Shevchenko,Y.,
Bouffard,G.G., Blakesley,R.W., Touchman,J.W., Green,E.D.,
Dickson,M.C., Rodriguez,A.C., Grimwood,J., Schmutz,J., Myers,R.M.,
Butterfield,Y.S., Krzywinski,M.I., Skalska,U., Smailus,D.E.,
Schnerch,A., Schein,J.E., Jones,S.J. and Marra,M.A.
TITLE Generation and initial analysis
of more than 15,000 full-length
human and mouse cDNA sequences
JOURNAL Proc. Natl. Acad. Sci. U.S.A. 99 (26),
16899-16903 (2002)
PUBMED 12477932
COMMENT PROVISIONAL REFSEQ: This record
has not yet been subject to final
NCBI review. The reference sequence was derived from AF217985.1.
On Jun 10, 2002 this sequence version replaced gi:7706672.
FEATURES
Location/Qualifiers
source
1..1208
/organism="Homo sapiens"
/mol_type="mRNA"
/db_xref="taxon:9606"
/chromosome="19"
/map="19q13.2"
gene
<1..1208
/gene="RAB4B"
/db_xref="GeneID:53916"
CDS
58..804
/gene="RAB4B"
/note="small GTP binding protein RAB4B"
/codon_start=1
/product="ras-related GTP-binding protein 4b"
/protein_id="NP_057238.2"
/db_xref="GI:21361509"
/db_xref="GeneID:53916"
/translation="MSVSLPLTVMVRERDWIGIHLFSLYLSLPVGIPDFGSIWSDFLF
KFLVIGSAGTGKSCLLHQFIENKFKQDSNHTIGVEFGSRVVNVGGKTVKLQIWDTAGQ
ERFRSVTRSYYRGAAGALLVYDITSRETYNSLAAWLTDARTLASPNIVVILCGNKKDL
DPEREVTFLEASRFAQENELMFLETSALTGENVEEAFLKCARTILNKIDSGELDPERM
GSGIQYGDASLRQLRQPRSAQAVAPQPCGC"
ORIGIN
1 tggcaagtac tacataagag
ttatctgtta ctgttaccta ttggtttgtt tctgcctatg
61 tctgtttctc tccctctaac tgtgatggtt
agagagagag attggattgg aatccatctt
121 ttttccctgt atctttctct ccctgtgggt
atccctgatt ttggctccat ctggtcagac
181 ttcctcttca aattcctggt gattggcagt
gcaggaactg gcaaatcatg tctccttcat
241 cagttcattg agaataagtt caaacaggac
tccaaccaca caatcggcgt ggagtttgga
301 tcccgggtgg tcaacgtggg tgggaagact
gtgaagctac agatttggga cacggctggc
361 caggagcggt ttcggtcagt gacgcggagt
tattaccgag gggcggctgg agccctgctg
421 gtgtacgaca tcaccagccg ggagacatac
aactcactgg ctgcctggct gacggatgcc
481 cgcaccctgg ccagccccaa catcgtggtc
atcctctgtg gcaacaagaa ggacctggac
541 cctgagcggg aggtcacttt cctggaggcc
tcccgctttg cccaggagaa tgagctgatg
601 ttcctggaga ccagcgctct cacaggcgag
aacgtggagg aggcgttcct caagtgtgcc
661 cgcactatcc tcaacaagat tgactcaggc
gagctagacc cggagaggat gggctctggc
721 attcagtacg gggatgcgtc cctccgccag
cttcggcagc ctcggagtgc ccaggccgtg
781 gcccctcagc cgtgtggctg ctgagctctg
tggagccagc tcacctgttc tccaggacca
841 gccctgctgg ggcccaggcc caggctctga
gaggccgtgt cctaacctgc cctggccccg
901 gagaagctac gttgccacct gtcccccttc
cctggcctgg tggggcctgg ctttggggca
961 agactgagcc acgggggaag ggggaatccc
gtacctgctg ctgcttcctc tgtcttggct
1021 aacgtctgtc cccctgaacc cctaaccata
tcccaagagc tcccaaagcc tgagaccagg
1081 gtcatttgtc cccaactccc catctggccc
tgctgttgct agtacctgtt atttattacc
1141 tggaggcctg tccagcaccc accctacccc
cataaagcat tgtttacaaa aaaaaaaaaa
1201 aaaaaaaa
//
NM_016154.1
1: NM_016154. Reports ...[gi:7706672] The record has been replaced by NM_016154.2
LOCUS NM_016154
1168 bp mRNA linear PRI
24-AUG-2001
DEFINITION Homo sapiens RAB4B, member RAS oncogene family
(RAB4B), mRNA.
ACCESSION NM_016154
VERSION NM_016154.1 GI:7706672
KEYWORDS .
SOURCE Homo sapiens (human)
ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
REFERENCE 1 (bases 1 to 1168)
AUTHORS Huang,C., Wu,T., Xu,S., Gu,W., Wang,Y.,
Han,Z. and Chen,Z.
TITLE Novel genes expressed in hematopoietic
stem/progenitor cells from
myelodysplastic syndromes patient
JOURNAL Unpublished
COMMENT PROVISIONAL REFSEQ: This record
has not yet been subject to final
NCBI review. The reference sequence was derived from AF165522.1.
[WARNING] On Jun 10, 2002 this sequence was replaced by a newer
version gi:21361508.
COMPLETENESS: full length.
FEATURES
Location/Qualifiers
source
1..1168
/organism="Homo sapiens"
/mol_type="mRNA"
/db_xref="taxon:9606"
/chromosome="19"
/map="19q13.2"
/cell_type="hematopoietic stem/progenitor cell"
gene
1..1168
/gene="RAB4B"
/db_xref="GeneID:53916"
CDS
124..765
/gene="RAB4B"
/codon_start=1
/product="ras-related GTP-binding protein 4b"
/protein_id="NP_057238.1"
/db_xref="GI:7706673"
/db_xref="GeneID:53916"
/translation="MAETYDFLFKFLVIGSAGTGKSCLLHQFIENKFKQDSNHTIGVE
FGSRVVNVGGKTVKLQIWDTAGQERFRSVTRSYYRGAAGALLVYDITSRETYNSLAAW
LTDARTLASPNIVVILCGNKKDLDPEREVTFLEASRFAQENELMFLETSALTGENVEE
AFLKCARTILNKIDSGELDPERMGSGIQYGDASLRQLRQPRSAQAVAPQPCGC"
misc_feature 148..630
/gene="RAB4B"
/note="arf; Region: ADP-ribosylation factor family"
/db_xref="CDD:pfam00025"
misc_feature 148..630
/gene="RAB4B"
/note="RAB; Region: Rab subfamily of small GTPases"
/db_xref="CDD:smart00175"
misc_feature 148..609
/gene="RAB4B"
/note="RAS; Region: Ras subfamily of RAS small GTPases"
/db_xref="CDD:smart00173"
misc_feature 151..663
/gene="RAB4B"
/note="ras; Region: Ras family"
/db_xref="CDD:pfam00071"
misc_feature 151..498
/gene="RAB4B"
/note="ARF; Region: ARF-like small GTPases"
/db_xref="CDD:smart00177"
misc_feature 151..495
/gene="RAB4B"
/note="SAR; Region: Sar1p-like members of the Ras-family
of small GTPases"
/db_xref="CDD:smart00178"
misc_feature 160..639
/gene="RAB4B"
/note="RHO; Region: Rho (Ras homology) subfamily of
Ras-like small GTPases"
/db_xref="CDD:smart00174"
misc_feature 163..633
/gene="RAB4B"
/note="RAN; Region: Ran (Ras-related nuclear proteins)
/TC4 subfamily of small GTPases"
/db_xref="CDD:smart00176"
misc_feature 286..669
/gene="RAB4B"
/note="GTP_EFTU; Region: Elongation factor Tu family"
/db_xref="CDD:pfam00009"
ORIGIN
1 astncsngth tntynchcka
gtaggaagga gccgttgctg tagccggagt ggagcggctg
61 ccagccgagg agcaggcgcg gccgcggcgc
catattgcgg ccctcagcgg ccgcgaccga
121 gtcatggctg agacctacga cttcctcttc
aaattcctgg tgattggcag tgcaggaact
181 ggcaaatcat gtctccttca tcagttcatt
gagaataagt tcaaacagga ctccaaccac
241 acaatcggcg tggagtttgg atcccgggtg
gtcaacgtgg gtgggaagac tgtgaagcta
301 cagatttggg acacggctgg ccaggagcgg
tttcggtcag tgacgcggag ttattaccga
361 ggggcggctg gagccctgct ggtgtacgac
atcaccagcc gggagacata caactcactg
421 gctgcctggc tgacggatgc ccgcaccctg
gccagcccca acatcgtggt catcctctgt
481 ggcaacaaga aggacctgga ccctgagcgg
gaggtcactt tcctggaggc ctcccgcttt
541 gcccaggaga atgagctgat gttcctggag
accagcgctc tcacaggcga gaacgtggag
601 gaggcgttcc tcaagtgtgc ccgcactatc
ctcaacaaga ttgactcagg cgagctagac
661 ccggagagga tgggctctgg cattcagtac
ggggatgcgt ccctccgcca gcttcggcag
721 cctcggagtg cccaggccgt ggcccctcag
ccgtgtggct gctgagctct gtggagccag
781 ctcacctgtt ctccaggacc agccctgctg
gggcccaggc ccaggctctg agaggccgtg
841 tcctaacctg ccctggcccc ggagaagcta
cgttgccacc tgtccccctt ccctggcctg
901 gtggggcctg gctttggggc aagactgagc
cacgggggaa gggggaatcc cgtacctgct
961 gctgcttcct ctgtcttggc taacgtctgt
ccccctgaac ccctaaccat atcccaagag
1021 ctcccaaagc ctgagaccag ggtcatttgt
ccccaactcc ccatctggcc ctgctgttgc
1081 tagtacctgt tatttattac ctggaggcct
gtccagcacc caccctaccc ccataaagca
1141 ttgtttacaa aaaaaaaaaa aaaaaaaa
//