Conf: 335400013200036776511455555432003565010220044453144310011246
Pred: CCCCCCCCCCCEECCCCCCCHHHHHHHHHHHEEEEEECCEEEEEEEEEHHHHHHHHHHHH
AA: AQFPVLGRTQAAYLAPGENLDDKRKDIKHTEKVITIDGLLTADVLIYDIEDAMNHYDVRS
70 80
90 100
110 120
Conf: 778765343210124236777532102676656520015762267764154334425777
Pred: HHHHHHHHHHHCCCCCHHHHHHHHHCCCCCCCCCCCCCCCCCEEEEEECCCCCCCCHHHH
AA: EYTSQLGESLAMAADGAVLAEIAGLCNVESKYNENIEGLGTATVIETTQNKAALTDQVAL
130 140
150 160
170 180
Conf: 899999987777653114467655058758862678899753662110320066765200
Pred: HHHHHHHHHHHHHHHHCCCCCCCCCEEEECCCCHHHHHHHHCCCCCCHHHHCCCCCCCCC
AA: GKEIIAALTKARAALTKNYVPAADRVFYCDPDSYSAILAALMPNAANYAALIDPEKGSIR
190 200
210 220
230 240
Conf: 013302430333114788876546788611202468863368750021100000133211
Pred: CCCCCEEEECCCCCCCCCCCCCCCCCCCCEEECCCCCCCEEEEEECCCEEEEEECCCCCE
AA: NVMGFEVVEVPHLTAGGAGTAREGTTGQKHVFPANKGEGNVKVAKDNVIGLFMHRSAVGT
250 260
270 280
290 300
Conf: 441344345553125413555555420357887332425789729
Pred: EEEHHHHHHHHHHCCCCHHHHHHHHHCCCCCCCCCCCCEEEEEEC
AA: VKLRDLALERARRANFQADQIIAKYAMGHGGLRPEAAGAVVFKVE
310 320
330 340
SAPS. Version of April 11, 1996.
Date run: Fri Mar 29 20:49:50 2002
********************************************************************************
Protein 1 (File: wwwtmp/.SAPS.22819.5493.seq)
SWISS-PROT ANNOTATION:
ID T7
DE T7 capsid, 345 bases, 2ED3E451 checksum.
number of residues: 345; molecular weight: 36.5 kdal
1 MASMTGGQQM GTNQGKGVVA AGDKLALFLK VFGGEVLTAF ARTSVTTSRH MVRSISSGKS
61 AQFPVLGRTQ AAYLAPGENL DDKRKDIKHT EKVITIDGLL TADVLIYDIE DAMNHYDVRS
121 EYTSQLGESL AMAADGAVLA EIAGLCNVES KYNENIEGLG TATVIETTQN KAALTDQVAL
181 GKEIIAALTK ARAALTKNYV PAADRVFYCD PDSYSAILAA LMPNAANYAA LIDPEKGSIR
241 NVMGFEVVEV PHLTAGGAGT AREGTTGQKH VFPANKGEGN VKVAKDNVIG LFMHRSAVGT
301 VKLRDLALER ARRANFQADQ IIAKYAMGHG GLRPEAAGAV VFKVE
--------------------------------------------------------------------------------
COMPOSITIONAL ANALYSIS (extremes relative to: ECOLI.q)
The composition of the input sequence is evaluated relative to the residue
usage quantile table specified with the `-s species' flag. Low usage in
the 1% quantile is indicated by the label -- (e.g., Y-- means that the
input sequence uses tyrosine as little as the 1% least tyrosine contain-
ing proteins in the reference set); low usage in the 5% quantile is indi-
cated by the label `-' (e.g., L-); high usage above the 95% quantile
point is indicated by the label `+' (e.g., A+); and high usage above the
99% quantile point is indicated by the label `++' (e.g., LIVFM++). The
usage is evaluated for all 20 amino acids, positive (KR) and negative (ED)
charge, total charge (KRED), net charge (KR-ED), major hydrophobics
(LVIFM), and the groupings ST, AGP (encoded by CCN, GCN, and GGN codons),
and FIKMNY (encoded by AAN, AUN, UAN, and UUN codons).
A+ : 51(14.8%); C : 2( 0.6%); D : 18( 5.2%); E : 20( 5.8%); F : 10( 2.9%)
G : 33( 9.6%); H : 7( 2.0%); I : 17( 4.9%); K : 21( 6.1%); L : 28( 8.1%)
M : 10( 2.9%); N : 15( 4.3%); P : 9( 2.6%); Q : 11( 3.2%); R : 16( 4.6%)
S : 15( 4.3%); T : 23( 6.7%); V : 29( 8.4%); W : 0( 0.0%); Y : 10( 2.9%)
KR : 37 ( 10.7%); ED : 38 ( 11.0%); AGP : 93 ( 27.0%);
KRED : 75 ( 21.7%); KR-ED : -1 ( -0.3%); FIKMNY : 83 ( 24.1%);
LVIFM : 94 ( 27.2%); ST : 38 ( 11.0%).
--------------------------------------------------------------------------------
CHARGE DISTRIBUTIONAL ANALYSIS
The distribution of charges in the protein sequence is evaluated in terms
of clusters, high scoring segments, and runs and periodic patterns. Clus-
ters indicate regions of typically 30 to 60 residues exhibiting a rela-
tively high charge concentration. For high scoring charge segments, posi-
tive scores are assigned to charge residues of the appropriate type and
negative scores to all other residues. A significant cumulative positive
score again indicates a region of high charge concentration. The cluster
method and the scoring method will generally pick out the same segments
(with the scoring method often delimiting the segment to a narrower
range), conferring robustness to the results. Short segments of high
charge concentration are displayed as runs (with errors). Periodic pat-
terns focus on those with charges every second or third position, with
possible relevance to amphipathic secondary structures; other periodic
patterns are displayed in the general periodicity analysis section of the
output.
1 0000000000 00000+0000 00-+00000+ 0000-00000 0+000000+0 00+00000+0
61 0000000+00 0000000-00 --+++-0+00 -+0000-000 00-0000-0- -00000-0+0
121 -000000-00 0000-00000 -0000000-0 +00-00-000 00000-0000 +0000-0000
181 0+-000000+ 0+0000+000 000-+0000- 0-00000000 0000000000 00-0-+000+
241 00000-00-0 0000000000 0+-00000+0 00000+0-00 0+00+-0000 0000+00000
301 0+0+-000-+ 0++00000-0 000+000000 00+0-00000 00+0-
A. CHARGE CLUSTERS.
Positive, negative, and mixed charge clusters are distinguished. In each
case, cmin indicates the minimum number of charges required for a signifi-
cant charge cluster corresponding to the given window size; e.g., cmin =
9/30 or 12/45 or 15/60 means that significance requires at least 9 charges
in a segment of 30 (or fewer) residues, or 12 charges in a segment of
length 45, or 15 charges in a segment of length 60. In the case of posi-
tive and negative charge clusters, these counts refer to net charge, i.e.,
charges of the opposite sign within the window are counted as -1. The
sizes of the clusters are optimized for display to indicate the segment of
highest charge concentration, but a minimum size of 20 residues is
required. A mixed charge cluster that begins and ends within 15 residues
of the endpoints of a pure charge cluster is not displayed (since its sig-
nificance rests mostly on the charged residues comprising the displayed
pure charge cluster), unless the -v (verbose output) flag is set, in which
case both the pure and the mixed charge cluster are displayed. On the
other hand, pure charge clusters that are embedded in mixed charge clus-
ters are displayed separately (indicated by a * preceding the specifica-
tion of location).
For each cluster are given its location in the sequence (From, to),
the quartile of the location (1st, 2nd, 3rd, or 4th quarter of the
sequence), length, count, and t-value (standard deviations above the mean;
to accommodate the multiple tests performed, the t-value significance
threshold is set to 4.0 for sequences up to 750 residues, to 4.5 for
sequences of length 750-1500 residues, and to 5.0 for longer sequences);
also indicated are residues comprising at least 10% of the cluster.
Positive charge clusters (cmin = 9/30 or 13/45 or 16/60): none
Negative charge clusters (cmin = 10/30 or 13/45 or 16/60): none
Mixed charge clusters (cmin = 15/30 or 20/45 or 25/60): none
B. HIGH SCORING (UN)CHARGED SEGMENTS.
For each scoring scheme (scores assigned to residues as displayed), SAPS
displays segments of the sequence with aggregate score exceeding the par-
ticular threshold values M_0.01 (1% significance level, segments labeled
with **), M_0.05 (5% significance level, segments labeled *), or other-
wise as indicated. A minimal segment length is set as shown. The expected
score/letter should be sufficiently large negative, and the average infor-
mation per letter should be sufficiently large positive in order for the
scoring statistics to apply properly (the program prints out when the con-
ditions are not met and skips evaluations).
______________________________________
High scoring positive charge segments:
score= 2.00 frequency= 0.107 ( KR )
score= 0.00 frequency= 0.000 ( BZX )
score= -1.00 frequency= 0.783 ( LAGSVTIPNFQYHMCW )
score= -2.00 frequency= 0.110 ( ED )
Expected score/letter: -0.788; Average information/letter: 1.306
Minimal length of displayed segments set to: 20
M_0.01= 10.20 (cv= 6.38, lambda= 0.91533, k= 0.32927, x= 3.81;
90% confidence interval for segment length: 10 +- 9)
M_0.05= 8.42 (x= 2.03)
# of segments (>=20 residues) exceeding M_0.05: none
______________________________________
High scoring negative charge segments:
score= 2.00 frequency= 0.110 ( ED )
score= 0.00 frequency= 0.000 ( BZX )
score= -1.00 frequency= 0.783 ( LAGSVTIPNFQYHMCW )
score= -2.00 frequency= 0.107 ( KR )
Expected score/letter: -0.777; Average information/letter: 1.259
Minimal length of displayed segments set to: 20
M_0.01= 10.37 (cv= 6.51, lambda= 0.89769, k= 0.32294, x= 3.87;
90% confidence interval for segment length: 11 +- 10)
M_0.05= 8.56 (x= 2.05)
# of segments (>=20 residues) exceeding M_0.05: none
___________________________________
High scoring mixed charge segments:
score= 1.00 frequency= 0.217 ( KEDR )
score= 0.00 frequency= 0.000 ( BZX )
score= -1.00 frequency= 0.783 ( LAGSVTIPNFQYHMCW )
Expected score/letter: -0.565; Average information/letter: 1.045
Minimal length of displayed segments set to: 20
M_0.01= 7.45 (cv= 4.56, lambda= 1.28093, k= 0.40821, x= 2.89;
90% confidence interval for segment length: 13 +- 10)
M_0.05= 6.18 (x= 1.62)
# of segments (>=20 residues) exceeding M_0.05: none
________________________________
High scoring uncharged segments:
score= 1.00 frequency= 0.783 ( LAGSVTIPNFQYHMCW )
score= 0.00 frequency= 0.000 ( BZX )
score= -8.00 frequency= 0.217 ( KEDR )
Expected score/letter: -0.957; Average information/letter: 0.175
Minimal length of displayed segments set to: 20
M_0.01= 41.11 (cv= 29.16, lambda= 0.20038, k= 0.11007, x= 11.94;
90% confidence interval for segment length: 68 +- 49)
M_0.05= 32.97 (x= 3.81)
# of segments (>=20 residues) exceeding M_0.05: none
C. CHARGE RUNS AND PATTERNS.
The table below shows the charge runs and patterns searched for (* stands
for + or -) and the required minimum number of matches to the pattern
allowing for at most 0 (lmin0), 1 (lmin1), or 2 (lmin2) mismatches or
insertions/deletions (1% significance level). Occurrences are arranged in
the order in which they appear in the sequence. For each run or pattern
are displayed its length (number of matches) and a triplet giving the
number of mismatches, insertions and deletions. 0-runs are further charac-
terized by their composition (residues comprising more than 10% of the
run).
Run count statistics are compiled for runs of lengths at least 2/3 of
the minimal significant length (lmin0); given are the number and locations
of such runs.
pattern (+)| (-)| (*)| (0)| (+0)| (-0)| (*0)|(+00)|(-00)|(*00)| (H.)|(H..)|
lmin0 5 | 5 | 7 | 36 | 9 | 9 | 12 | 11 | 11 | 15 | 6 | 8 |
lmin1 6 | 6 | 8 | 44 | 11 | 11 | 15 | 13 | 14 | 18 | 8 | 10 |
lmin2 7 | 7 | 10 | 49 | 13 | 13 | 17 | 15 | 15 | 20 | 9 | 11 |
(Significance level: 0.010000; Minimal displayed length: 6)
There are no charge runs or patterns exceeding the given minimal lengths.
Run count statistics:
+ runs >= 3: 1, at 83;
- runs >= 3: 0
* runs >= 4: 1, at 81;
0 runs >= 24: 0
--------------------------------------------------------------------------------
DISTRIBUTION OF OTHER AMINO ACID TYPES
Routinely, SAPS indicates high scoring hydrophobic and transmembrane seg-
ments. The display is as desribed above for high scoring charge segments.
The scores for the hydrophobic segments correspond to a digitized hydro-
pathy scale. The transmembrane scores were derived from target frequen-
cies in putative transmembrane proteins (see the paper referred to above;
note, however, that the scores used in the program have been rederived and
differ from the ones given in the paper). With the -a command line flag,
the user can invoke a similar analysis for other residue types. In view
of the special role of cysteines for protein structure, the spacings of
the cysteine residues in the sequence are displayed separately, with par-
ticular emphasis on close pairs of cysteines and distances between such
pairs.
1. HIGH SCORING SEGMENTS.
__________________________________
High scoring hydrophobic segments:
2.00 (LVIFM) 1.00 (AGYCW) 0.00 (BZX) -2.00 (PH) -4.00 (STNQ)
-8.00 (KEDR)
Expected score/letter: -1.751; Average information/letter: 0.499
Minimal length of displayed segments set to: 15
M_0.01= 27.57 (cv= 17.87, lambda= 0.32695, k= 0.23899, x= 9.69;
90% confidence interval for segment length: 26 +- 17)
M_0.05= 22.58 (x= 4.71)
# of segments (>=15 residues) exceeding M_0.05: none
____________________________________
High scoring transmembrane segments:
5.00 (LVIF) 2.00 (AGM) 0.00 (BZX) -1.00 (YCW) -2.00 (ST)
-6.00 (P) -8.00 (H) -10.00 (NQ) -16.00 (KR) -17.00 (ED)
Expected score/letter: -3.154; Average information/letter: 0.390
Minimal length of displayed segments set to: 15
M_0.01= 66.94 (cv= 45.01, lambda= 0.12982, k= 0.17305, x= 21.92;
90% confidence interval for segment length: 32 +- 23)
M_0.05= 54.38 (x= 9.37); M_0.30= 39.44 (x= -5.57)
# of segments (>=15 residues) exceeding M_0.30: none
2. SPACINGS OF C.
H2N-145-C-62-C-136-COOH
--------------------------------------------------------------------------------
REPETITIVE STRUCTURES.
Repeats are indicated for two alphabets: the 20-letter amino acid alpha-
bet, and a reduced 11-letter alphabet in which the major hydrophobics
LVIF, the charged residues KR and ED, the small residues AG, the hydroxyl
group residues ST, the amid group residues NQ, and the aromatics YW are
treated as combined letters. For each alphabet, three classes of repeats
are distinguished: separated repeats, simple tandem repeats, and periodic
repeats. The separated repeats are largely non-overlapping. They are
displayed in groups of matching blocks (exceeding a given core block
length of contiguous exact matches) and intervening spacer distances
(which may be negative, signifying a partial overlap). The core block
length in case of the amino acid alphabet is set to 4 for sequences up to
500 residues, to 5 for sequences between 500 and 2000 residues, and to 6
for longer sequences (same values increased by 4 for the reduced alpha-
bet). Simple tandem repeats are displayed in similar layout, but
separately. Sequence segments that are highly repetitive with relatively
short repeats are displayed as periodic repeats.
A. SEPARATED, TANDEM, AND PERIODIC REPEATS: amino acid alphabet.
Repeat core block length: 4
Aligned matching blocks:
[ 172- 175] AALT
[ 186- 190] AALTK
[ 193- 197] AALTK
B. SEPARATED AND TANDEM REPEATS: 11-letter reduced alphabet.
(i= LVIF; += KR; -= ED; s= AG; o= ST; n= NQ; a= YW; p= P; h= H; m= M; c= C)
Repeat core block length: 8
--------------------------------------------------------------------------------
MULTIPLETS.
Multiplets refer to homooligopeptides of any length (e.g., A2, Q7, etc.);
altplets refer to reiterations of two different residues (e.g., RG,
EAEAEA, etc.). The multiplet composition of the protein sequence is
evaluated for both the amino acid and the charge alphabet. (High) Aggre-
gate altplet counts are evalued only for the charge alphabet. The multi-
plet sequence is displayed whenever the total multiplet count of the
sequence falls outside the expected range (i.e., beyond 3 standard devia-
tions of the mean). Printed are also the histogram of the spacings between
consecutive multiplets (differences between starting positions) as well as
clusters of multiplets (multiplet clusters are determined in the same way
as charge clusters are determined; the binomial test is applied to a
compressed sequence over the alphabet {M,S}, where M signifies a multiplet
and S signifies a singlet; i.e., the amino acid sequence AADFFFGHRRT... is
translated as MSMSSMS..., and the binomial cluster test is applied to the
latter sequence). Multiplets and altplets of specific residue content that
individually show an unusually high count are indicated, and the positions
of all multiplets exceeding a minimum length of 5 residues are shown.
A. AMINO ACID ALPHABET.
1. Total number of amino acid multiplets: 28 (Expected range: 8-- 37)
2. Histogram of spacings between consecutive amino acid multiplets:
(1-5) 9 (6-10) 10 (11-20) 7 (>=21) 3
3. Clusters of amino acid multiplets (cmin = 13/30 or 17/45 or 20/60): none
B. CHARGE ALPHABET.
1. Total number of charge multiplets: 4 (Expected range: 0-- 16)
2 +plets (f+: 10.7%), 2 -plets (f-: 11.0%)
Total number of charge altplets: 11 (Critical number: 18)
2. Histogram of spacings between consecutive charge multiplets:
(1-5) 1 (6-10) 0 (11-20) 0 (>=21) 4
--------------------------------------------------------------------------------
PERIODICITY ANALYSIS.
The program identifies periodic elements of periods between 1 and 10 for
the amino acid alphabet, for the charge alphabet, and for a hydrophobicity
alphabet. Each periodic element consists of an error-free core pattern (of
length at least 4 for the amino acid alphabet, 5 for the charge alphabet,
and 6 for the hydrophobicity alphabet) which is extended allowing for
errors. The numbers of errors are given for each position in the con-
sensus of a periodic pattern involving more than one letter. The displayed
periodic patterns would generally not be statistically significant but are
listed for the sake of a general interactive appraisal of the sequence.
Periodicities of exceptionally high copy number are indicated with a !-
mark.
A. AMINO ACID ALPHABET (core: 4; !-core: 5)
Location Period Element Copies Core Errors
131- 145 3 A.. 5 5 ! 0
172- 199 7 AALT... 4 4 /0/1/1/1/./././
B. CHARGE ALPHABET ({+= KR; -= ED; 0}; core: 5; !-core: 6)
and HYDROPHOBICITY ALPHABET ({*= KRED; i= LVIF; 0}; core: 6; !-core: 9)
Location Period Element Copies Core Errors
81- 86 1 * 6 6 0
283- 312 5 i.*.. 6 6 /0/./2/././
--------------------------------------------------------------------------------
SPACING ANALYSIS.
The spacings between consecutive residues of the same type (all 20 amino
acids, + and - charge, and combined charge *) are evaluated for signifi-
cantly large or small maximal and minimal spacings. The output is ordered
by the beginning point of the significant spacing. Entries are identified
by the residue type, spacing (number of amino acids between the identified
positions), rank of the displayed spacing (e.g., 50 alanines in the
sequence induce 51 spacings, ranked by decreasing length from 1 to 51),
and p-value (probability of exceeding the displayed spacing). A maximal
spacing with p-value 0.01 or less is considered significantly large; a
maximal spacing with p-value 0.99 or larger is considered significantly
small. Similarly, a minimal spacing with p-value 0.99 or larger is con-
sidered significantly small, and a minimal spacing with p-value 0.01 or
less is considered significantly large (excluding doublets). If the first
maximal spacing (rank 1) of a residue is significantly large or small,
then also the second maximal spacing (rank 2) is evaluated. Large maximal
and small minimal spacings indicate clustering effects, whereas small max-
imal and large minimal spacings indicate excessive evenness in the distri-
bution of the residues.
There are no unusual spacings.
SAPS output for T7 10B (frameshifted C ter. relative to 10A)
SAPS. Version of April 11, 1996.
Date run: Sat May 4 22:15:12 2002
SAPS (Statistical Analysis of Protein Sequences) evaluates by statistical
criteria a wide variety of protein sequence properties. A full description
of the methods is given in the paper referred to below. The output is or-
ganized in the following sections: file name, sequence printout, composi-
tional analysis, charge distributional analysis (charge clusters; high
scoring (un)charged segments; charge runs and patterns), distribution of
other amino acid types (high scoring hydrophobic and transmembrane seg-
ments; cysteine spacings), repetitive structures (in the amino acid alpha-
bet and in a 11-letter reduced alphabet), multiplets (counts, spacings,
and clusters in the amino acid and charge alphabets), periodicity
analysis, spacing analysis. Each section is annotated below under its sec-
tion title.
The SAPS program was developed in the group of Prof. Samuel Karlin at
Stanford University. Correspondence relating to SAPS should be addressed
to either Volker Brendel or Samuel Karlin at the Department of Mathemat-
ics, Stanford University, Stanford CA 94305, U.S.A.; phone: (415) 723-
2209; fax: (415) 725-2040; email: volker@gnomic.stanford.edu. Users of the
program should cite the following reference: Brendel, V., Bucher, P.,
Nourbakhsh, I., Blaisdell, B.E., Karlin, S. (1992) Methods and algorithms
for statistical analysis of protein sequences. Proc. Natl. Acad. Sci. USA
89: 2002-2006.
********************************************************************************
Protein 1 (File: wwwtmp/.SAPS.12394.6118.seq)
SWISS-PROT ANNOTATION:
ID T710B
DE T710B, 398 bases, 772BEF00 checksum.
number of residues: 398; molecular weight: 41.8 kdal
1 MASMTGGQQM GTNQGKGVVA AGDKLALFLK VFGGEVLTAF ARTSVTTSRH MVRSISSGKS
61 AQFPVLGRTQ AAYLAPGENL DDKRKDIKHT EKVITIDGLL TADVLIYDIE DAMNHYDVRS
121 EYTSQLGESL AMAADGAVLA EIAGLCNVES KYNENIEGLG TATVIETTQN KAALTDQVAL
181 GKEIIAALTK ARAALTKNYV PAADRVFYCD PDSYSAILAA LMPNAANYAA LIDPEKGSIR
241 NVMGFEVVEV PHLTAGGAGT AREGTTGQKH VFPANKGEGN VKVAKDNVIG LFMHRSAVGT
301 VKLRDLALER ARRANFQADQ IIAKYAMGHG GLRPEAAGAV VFQSGVMLGV ASTVAASPEE
361 ASVTSTEETL TPAQEAARTR AANKARKEAE LAAATAEQ
--------------------------------------------------------------------------------
COMPOSITIONAL ANALYSIS (extremes relative to: swp23s.q)
The composition of the input sequence is evaluated relative to the residue
usage quantile table specified with the `-s species' flag. Low usage in
the 1% quantile is indicated by the label -- (e.g., Y-- means that the
input sequence uses tyrosine as little as the 1% least tyrosine contain-
ing proteins in the reference set); low usage in the 5% quantile is indi-
cated by the label `-' (e.g., L-); high usage above the 95% quantile
point is indicated by the label `+' (e.g., A+); and high usage above the
99% quantile point is indicated by the label `++' (e.g., LIVFM++). The
usage is evaluated for all 20 amino acids, positive (KR) and negative (ED)
charge, total charge (KRED), net charge (KR-ED), major hydrophobics
(LVIFM), and the groupings ST, AGP (encoded by CCN, GCN, and GGN codons),
and FIKMNY (encoded by AAN, AUN, UAN, and UUN codons).
A++: 66(16.6%); C : 2( 0.5%); D : 18( 4.5%); E : 27( 6.8%); F : 10( 2.5%)
G : 35( 8.8%); H : 7( 1.8%); I : 17( 4.3%); K : 22( 5.5%); L : 31( 7.8%)
M : 11( 2.8%); N : 16( 4.0%); P : 11( 2.8%); Q : 14( 3.5%); R : 19( 4.8%)
S : 20( 5.0%); T : 30( 7.5%); V : 32( 8.0%); W : 0( 0.0%); Y : 10( 2.5%)
KR : 41 ( 10.3%); ED : 45 ( 11.3%); AGP : 112 ( 28.1%);
KRED : 86 ( 21.6%); KR-ED : -4 ( -1.0%); FIKMNY : 86 ( 21.6%);
LVIFM : 101 ( 25.4%); ST : 50 ( 12.6%).
--------------------------------------------------------------------------------
CHARGE DISTRIBUTIONAL ANALYSIS
The distribution of charges in the protein sequence is evaluated in terms
of clusters, high scoring segments, and runs and periodic patterns. Clus-
ters indicate regions of typically 30 to 60 residues exhibiting a rela-
tively high charge concentration. For high scoring charge segments, posi-
tive scores are assigned to charge residues of the appropriate type and
negative scores to all other residues. A significant cumulative positive
score again indicates a region of high charge concentration. The cluster
method and the scoring method will generally pick out the same segments
(with the scoring method often delimiting the segment to a narrower
range), conferring robustness to the results. Short segments of high
charge concentration are displayed as runs (with errors). Periodic pat-
terns focus on those with charges every second or third position, with
possible relevance to amphipathic secondary structures; other periodic
patterns are displayed in the general periodicity analysis section of the
output.
1 0000000000 00000+0000 00-+00000+ 0000-00000 0+000000+0 00+00000+0
61 0000000+00 0000000-00 --+++-0+00 -+0000-000 00-0000-0- -00000-0+0
121 -000000-00 0000-00000 -0000000-0 +00-00-000 00000-0000 +0000-0000
181 0+-000000+ 0+0000+000 000-+0000- 0-00000000 0000000000 00-0-+000+
241 00000-00-0 0000000000 0+-00000+0 00000+0-00 0+00+-0000 0000+00000
301 0+0+-000-+ 0++00000-0 000+000000 00+0-00000 0000000000 00000000--
361 000000--00 0000-00+0+ 000+0++-0- 000000-0
A. CHARGE CLUSTERS.
Positive, negative, and mixed charge clusters are distinguished. In each
case, cmin indicates the minimum number of charges required for a signifi-
cant charge cluster corresponding to the given window size; e.g., cmin =
9/30 or 12/45 or 15/60 means that significance requires at least 9 charges
in a segment of 30 (or fewer) residues, or 12 charges in a segment of
length 45, or 15 charges in a segment of length 60. In the case of posi-
tive and negative charge clusters, these counts refer to net charge, i.e.,
charges of the opposite sign within the window are counted as -1. The
sizes of the clusters are optimized for display to indicate the segment of
highest charge concentration, but a minimum size of 20 residues is
required. A mixed charge cluster that begins and ends within 15 residues
of the endpoints of a pure charge cluster is not displayed (since its sig-
nificance rests mostly on the charged residues comprising the displayed
pure charge cluster), unless the -v (verbose output) flag is set, in which
case both the pure and the mixed charge cluster are displayed. On the
other hand, pure charge clusters that are embedded in mixed charge clus-
ters are displayed separately (indicated by a * preceding the specifica-
tion of location).
For each cluster are given its location in the sequence (From, to),
the quartile of the location (1st, 2nd, 3rd, or 4th quarter of the
sequence), length, count, and t-value (standard deviations above the mean;
to accommodate the multiple tests performed, the t-value significance
threshold is set to 4.0 for sequences up to 750 residues, to 4.5 for
sequences of length 750-1500 residues, and to 5.0 for longer sequences);
also indicated are residues comprising at least 10% of the cluster.
Positive charge clusters (cmin = 9/30 or 12/45 or 15/60): none
Negative charge clusters (cmin = 10/30 or 13/45 or 16/60): none
Mixed charge clusters (cmin = 15/30 or 20/45 or 25/60): none
B. HIGH SCORING (UN)CHARGED SEGMENTS.
For each scoring scheme (scores assigned to residues as displayed), SAPS
displays segments of the sequence with aggregate score exceeding the par-
ticular threshold values M_0.01 (1% significance level, segments labeled
with **), M_0.05 (5% significance level, segments labeled *), or other-
wise as indicated. A minimal segment length is set as shown. The expected
score/letter should be sufficiently large negative, and the average infor-
mation per letter should be sufficiently large positive in order for the
scoring statistics to apply properly (the program prints out when the con-
ditions are not met and skips evaluations).
______________________________________
High scoring positive charge segments:
score= 2.00 frequency= 0.103 ( KR )
score= 0.00 frequency= 0.000 ( BZX )
score= -1.00 frequency= 0.784 ( LAGSVTIPNFQYHMCW )
score= -2.00 frequency= 0.113 ( ED )
Expected score/letter: -0.804; Average information/letter: 1.377
Minimal length of displayed segments set to: 20
M_0.01= 10.09 (cv= 6.36, lambda= 0.94140, k= 0.33816, x= 3.73;
90% confidence interval for segment length: 10 +- 9)
M_0.05= 8.36 (x= 2.00)
# of segments (>=20 residues) exceeding M_0.05: none
______________________________________
High scoring negative charge segments:
score= 2.00 frequency= 0.113 ( ED )
score= 0.00 frequency= 0.000 ( BZX )
score= -1.00 frequency= 0.784 ( LAGSVTIPNFQYHMCW )
score= -2.00 frequency= 0.103 ( KR )
Expected score/letter: -0.764; Average information/letter: 1.211
Minimal length of displayed segments set to: 20
M_0.01= 10.72 (cv= 6.80, lambda= 0.87993, k= 0.31623, x= 3.92;
90% confidence interval for segment length: 11 +- 10)
M_0.05= 8.87 (x= 2.07)
# of segments (>=20 residues) exceeding M_0.05: none
___________________________________
High scoring mixed charge segments:
score= 1.00 frequency= 0.216 ( KEDR )
score= 0.00 frequency= 0.000 ( BZX )
score= -1.00 frequency= 0.784 ( LAGSVTIPNFQYHMCW )
Expected score/letter: -0.568; Average information/letter: 1.056
Minimal length of displayed segments set to: 20
M_0.01= 7.53 (cv= 4.65, lambda= 1.28866, k= 0.41132, x= 2.88;
90% confidence interval for segment length: 13 +- 10)
M_0.05= 6.26 (x= 1.62)
# of segments (>=20 residues) exceeding M_0.05: none
________________________________
High scoring uncharged segments:
score= 1.00 frequency= 0.784 ( LAGSVTIPNFQYHMCW )
score= 0.00 frequency= 0.000 ( BZX )
score= -8.00 frequency= 0.216 ( KEDR )
Expected score/letter: -0.945; Average information/letter: 0.172
Minimal length of displayed segments set to: 20
M_0.01= 42.20 (cv= 30.21, lambda= 0.19817, k= 0.10814, x= 11.99;
90% confidence interval for segment length: 70 +- 51)
M_0.05= 33.97 (x= 3.76)
# of segments (>=20 residues) exceeding M_0.05: none
C. CHARGE RUNS AND PATTERNS.
The table below shows the charge runs and patterns searched for (* stands
for + or -) and the required minimum number of matches to the pattern
allowing for at most 0 (lmin0), 1 (lmin1), or 2 (lmin2) mismatches or
insertions/deletions (1% significance level). Occurrences are arranged in
the order in which they appear in the sequence. For each run or pattern
are displayed its length (number of matches) and a triplet giving the
number of mismatches, insertions and deletions. 0-runs are further charac-
terized by their composition (residues comprising more than 10% of the
run).
Run count statistics are compiled for runs of lengths at least 2/3 of
the minimal significant length (lmin0); given are the number and locations
of such runs.
pattern (+)| (-)| (*)| (0)| (+0)| (-0)| (*0)|(+00)|(-00)|(*00)| (H.)|(H..)|
lmin0 5 | 5 | 7 | 37 | 9 | 9 | 12 | 11 | 11 | 15 | 6 | 8 |
lmin1 6 | 6 | 8 | 45 | 11 | 11 | 15 | 13 | 14 | 18 | 8 | 10 |
lmin2 7 | 7 | 10 | 50 | 13 | 13 | 17 | 15 | 16 | 20 | 9 | 11 |
(Significance level: 0.010000; Minimal displayed length: 6)
There are no charge runs or patterns exceeding the given minimal lengths.
Run count statistics:
+ runs >= 3: 1, at 83;
- runs >= 3: 0
* runs >= 5: 1, at 81;
0 runs >= 25: 0
--------------------------------------------------------------------------------
DISTRIBUTION OF OTHER AMINO ACID TYPES
Routinely, SAPS indicates high scoring hydrophobic and transmembrane seg-
ments. The display is as desribed above for high scoring charge segments.
The scores for the hydrophobic segments correspond to a digitized hydro-
pathy scale. The transmembrane scores were derived from target frequen-
cies in putative transmembrane proteins (see the paper referred to above;
note, however, that the scores used in the program have been rederived and
differ from the ones given in the paper). With the -a command line flag,
the user can invoke a similar analysis for other residue types. In view
of the special role of cysteines for protein structure, the spacings of
the cysteine residues in the sequence are displayed separately, with par-
ticular emphasis on close pairs of cysteines and distances between such
pairs.
1. HIGH SCORING SEGMENTS.
__________________________________
High scoring hydrophobic segments:
2.00 (LVIFM) 1.00 (AGYCW) 0.00 (BZX) -2.00 (PH) -4.00 (STNQ)
-8.00 (KEDR)
Expected score/letter: -1.832; Average information/letter: 0.547
Minimal length of displayed segments set to: 15
M_0.01= 26.30 (cv= 17.06, lambda= 0.35084, k= 0.25652, x= 9.23;
90% confidence interval for segment length: 24 +- 16)
M_0.05= 21.65 (x= 4.59)
# of segments (>=15 residues) exceeding M_0.05: none
____________________________________
High scoring transmembrane segments:
5.00 (LVIF) 2.00 (AGM) 0.00 (BZX) -1.00 (YCW) -2.00 (ST)
-6.00 (P) -8.00 (H) -10.00 (NQ) -16.00 (KR) -17.00 (ED)
Expected score/letter: -3.219; Average information/letter: 0.411
Minimal length of displayed segments set to: 15
M_0.01= 64.79 (cv= 43.73, lambda= 0.13690, k= 0.17954, x= 21.06;
90% confidence interval for segment length: 31 +- 22)
M_0.05= 52.88 (x= 9.15); M_0.30= 38.71 (x= -5.01)
1) From 336 to 356: length= 21, score=39.00
(pocket at 343 to 344: length= 2, score=-12.00)
336 AAGAVVF |QS| G VMLGVASTVA A
A: 6(28.6%); G: 3(14.3%); V: 5(23.8%);
# of segments (>=15 residues) exceeding M_0.30: 1
2. SPACINGS OF C.
H2N-145-C-62-C-189-COOH
--------------------------------------------------------------------------------
REPETITIVE STRUCTURES.
Repeats are indicated for two alphabets: the 20-letter amino acid alpha-
bet, and a reduced 11-letter alphabet in which the major hydrophobics
LVIF, the charged residues KR and ED, the small residues AG, the hydroxyl
group residues ST, the amid group residues NQ, and the aromatics YW are
treated as combined letters. For each alphabet, three classes of repeats
are distinguished: separated repeats, simple tandem repeats, and periodic
repeats. The separated repeats are largely non-overlapping. They are
displayed in groups of matching blocks (exceeding a given core block
length of contiguous exact matches) and intervening spacer distances
(which may be negative, signifying a partial overlap). The core block
length in case of the amino acid alphabet is set to 4 for sequences up to
500 residues, to 5 for sequences between 500 and 2000 residues, and to 6
for longer sequences (same values increased by 4 for the reduced alpha-
bet). Simple tandem repeats are displayed in similar layout, but
separately. Sequence segments that are highly repetitive with relatively
short repeats are displayed as periodic repeats.
A. SEPARATED, TANDEM, AND PERIODIC REPEATS: amino acid alphabet.
Repeat core block length: 4
Aligned matching blocks:
[ 172- 175] AALT
[ 186- 190] AALTK
[ 193- 197] AALTK
B. SEPARATED AND TANDEM REPEATS: 11-letter reduced alphabet.
(i= LVIF; += KR; -= ED; s= AG; o= ST; n= NQ; a= YW; p= P; h= H; m= M; c= C)
Repeat core block length: 8
--------------------------------------------------------------------------------
MULTIPLETS.
Multiplets refer to homooligopeptides of any length (e.g., A2, Q7, etc.);
altplets refer to reiterations of two different residues (e.g., RG,
EAEAEA, etc.). The multiplet composition of the protein sequence is
evaluated for both the amino acid and the charge alphabet. (High) Aggre-
gate altplet counts are evalued only for the charge alphabet. The multi-
plet sequence is displayed whenever the total multiplet count of the
sequence falls outside the expected range (i.e., beyond 3 standard devia-
tions of the mean). Printed are also the histogram of the spacings between
consecutive multiplets (differences between starting positions) as well as
clusters of multiplets (multiplet clusters are determined in the same way
as charge clusters are determined; the binomial test is applied to a
compressed sequence over the alphabet {M,S}, where M signifies a multiplet
and S signifies a singlet; i.e., the amino acid sequence AADFFFGHRRT... is
translated as MSMSSMS..., and the binomial cluster test is applied to the
latter sequence). Multiplets and altplets of specific residue content that
individually show an unusually high count are indicated, and the positions
of all multiplets exceeding a minimum length of 5 residues are shown.
A. AMINO ACID ALPHABET.
1. Total number of amino acid multiplets: 34 (Expected range: 12-- 43)
2. Histogram of spacings between consecutive amino acid multiplets:
(1-5) 11 (6-10) 13 (11-20) 8 (>=21) 3
3. Clusters of amino acid multiplets (cmin = 13/30 or 17/45 or 21/60): none
B. CHARGE ALPHABET.
1. Total number of charge multiplets: 7 (Expected range: 0-- 17)
3 +plets (f+: 10.3%), 4 -plets (f-: 11.3%)
Total number of charge altplets: 12 (Critical number: 20)
2. Histogram of spacings between consecutive charge multiplets:
(1-5) 1 (6-10) 1 (11-20) 2 (>=21) 4
--------------------------------------------------------------------------------
PERIODICITY ANALYSIS.
The program identifies periodic elements of periods between 1 and 10 for
the amino acid alphabet, for the charge alphabet, and for a hydrophobicity
alphabet. Each periodic element consists of an error-free core pattern (of
length at least 4 for the amino acid alphabet, 5 for the charge alphabet,
and 6 for the hydrophobicity alphabet) which is extended allowing for
errors. The numbers of errors are given for each position in the con-
sensus of a periodic pattern involving more than one letter. The displayed
periodic patterns would generally not be statistically significant but are
listed for the sake of a general interactive appraisal of the sequence.
Periodicities of exceptionally high copy number are indicated with a !-
mark.
A. AMINO ACID ALPHABET (core: 4; !-core: 5)
Location Period Element Copies Core Errors
131- 145 3 A.. 5 5 ! 0
172- 199 7 AALT... 4 4 /0/1/1/1/./././
373- 396 4 A... 6 6 ! 0
B. CHARGE ALPHABET ({+= KR; -= ED; 0}; core: 5; !-core: 6)
and HYDROPHOBICITY ALPHABET ({*= KRED; i= LVIF; 0}; core: 6; !-core: 9)
Location Period Element Copies Core Errors
81- 86 1 * 6 6 0
283- 312 5 i.*.. 6 6 /0/./2/././
--------------------------------------------------------------------------------
SPACING ANALYSIS.
The spacings between consecutive residues of the same type (all 20 amino
acids, + and - charge, and combined charge *) are evaluated for signifi-
cantly large or small maximal and minimal spacings. The output is ordered
by the beginning point of the significant spacing. Entries are identified
by the residue type, spacing (number of amino acids between the identified
positions), rank of the displayed spacing (e.g., 50 alanines in the
sequence induce 51 spacings, ranked by decreasing length from 1 to 51),
and p-value (probability of exceeding the displayed spacing). A maximal
spacing with p-value 0.01 or less is considered significantly large; a
maximal spacing with p-value 0.99 or larger is considered significantly
small. Similarly, a minimal spacing with p-value 0.99 or larger is con-
sidered significantly small, and a minimal spacing with p-value 0.01 or
less is considered significantly large (excluding doublets). If the first
maximal spacing (rank 1) of a residue is significantly large or small,
then also the second maximal spacing (rank 2) is evaluated. Large maximal
and small minimal spacings indicate clustering effects, whereas small max-
imal and large minimal spacings indicate excessive evenness in the distri-
bution of the residues.
There are no unusual spacings.