This page describes the PSORT II variables which are potentially used by WoLF PSORT. However since WoLF PSORT uses a feature selection process to discard features which are not clearly useful for prediction, not all of the feature variables described here are actually used for WoLF PSORT prediction. You may still obtain information about the value of these unused features by following the links to the PSORT II verbose output.
Feature Index:
a-l:
act,
alm,
bac,
caa,
dna,
erl,
erm,
gpi,
gvh,
leu
m :
m1a,
m1b,
m2,
mNt,
m3a,
m3b,
m_,
mip,
mit,
myr
n-z:
nuc,
pox,
px2,
psg,
rib,
rnp,
tms,
tyr,
top,
vac,
yqr
Name | Description | ||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
psg | Score for the presence of signal peptides; the algorithm was basically developed by McGeoch, "On the predictive recognition of signal peptide sequences", Virus Res., 3(3):271-286. 1985. This feature was further expanded by Nakai and Kanehisa in 1992; published in "Refinement of the prediction methods of signal peptides for the genome analyses of Saccharomyces cerevisiae and Bacillus subtilis", Genome Informatics, 1996. | ||||||||||||||||||||||||||||||||||||||||||||||||
gvh | Original weight-matrix score of von Heijne, "A new method for predicting signal sequence cleavage sites", Nucleic Acids Res., 14:4683-90, 1986. Used for the recognition of cleavage sites of eukaryotic signal peptides. (For a no longer relevant historical technical reason, the value actually computed (and displayed) has 7.5 subtracted from it). | ||||||||||||||||||||||||||||||||||||||||||||||||
alm | Discriminant score, calculated from the most hydrophobic 17 residue segment, given by the algorithm of Klein, Kanehisa, DeLisi, "The detection and classification of membrane-spanning proteins", Biochim Biophys Acta., 815(3):468-76, 1985. | ||||||||||||||||||||||||||||||||||||||||||||||||
top | Score for predicting the topology of membrane proteins, by Hartmann, Rapoport, Lodish's method, "Predicting the orientation of eukaryotic membrane-spanning proteins", Proc Natl Acad Sci USA., 86(15):5786-90, 1989. It is essentially the net charge difference of 15 residues flanking the most N-terminal transmembrane segment on both sides. | ||||||||||||||||||||||||||||||||||||||||||||||||
tms | Predicted number of transmembrane segments (except the cleavable signal peptide), given by Klein, Kanehisa, DeLisi's algorithm, "The detection and classification of membrane-spanning proteins", Biochim Biophys Acta., 815(3):468-76, 1985. | ||||||||||||||||||||||||||||||||||||||||||||||||
mit | Score for the presence of an N-terminal mitochondrial targeting signal, which is calculated from the amino acid composition of the N-terminal 20 residues. Developed by Nakai and Kanehisa, "A knowledge base for predicting protein localization sites in eukaryotic cells", Genomics, 14:897-911, 1992. | ||||||||||||||||||||||||||||||||||||||||||||||||
mip | Predicted position of the cleavage sites of mitochondrial targeting signals by Gavel and von Heijne's method, "Cleavage-site motifs in mitochondrial targeting peptides.", Protein Eng., 4(1):33-7, 1990. | ||||||||||||||||||||||||||||||||||||||||||||||||
nuc | Discriminant score for being a nuclear protein, calculated from the presence of NLS motif,
bipartite motif, and the amino acid composition, by
Nakai and Kanehisa.
"A knowledge base for predicting protein localization sites in eukaryotic cells",
Genomics, 14:897-911, 1992.
More specifically, this feature is a meta-feature consisting of features described below. Several of the features check for the presence of either one of the basic residues 'R' or 'K'. Although not a standard notation, for convenience I use 'B' here to denote a residue which is either 'R' or 'K'.
Using this feature alone, with a threshold of (nuc > 1), one obtains a (sensitivity, specificity) of (30%, 71%), (19%, 65%), and (21%, 77%) for animal, plant, and fungi respectively. Graphical Summary. | ||||||||||||||||||||||||||||||||||||||||||||||||
erl | "1" if the last four C-terminal residues are "KDEL" or "HDEL", 0 otherwise. As an E.R. protein detector, this feature has modest sensitivity (9-25%) but excellent specificity. In our current (2007/3) dataset, all of the proteins with a score of "1" are E.R. proteins. | ||||||||||||||||||||||||||||||||||||||||||||||||
erm | Score originally intended to detect ER membrane protein, calculated from the presence of the retention signals and the membrane topology. PSORT II original feature, Nakai & Horton,
"PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization",
Trends in Biochemical Sciences, 24(1):34-5, 1999.
The actual definition of "erm"'s value is slighly complicated. It is the sum of two quantities, which we will call N-score and C-score here. Let the amino acid sequence be denoted as S1,S2, ... Sn. The N-score is the number of arginine residues in the first four residues skipping the most C-terminal residue (i.e. S2S3S4S5), plus a bonus of 2 if those residues contain at least one arginine and the predicted membrane type (by the method of Klein et al., see the tms feature) is type "2" or Type "Nt". The N-score is the number of lysine residues in the last four residues up to but not including the most C-terminal residue (i.e. Sn-4Sn-3Sn-2Sn-1), plus a bonus of 2 if those residues contain at least one lysine and the predicted membrane type (by the method of Klein et al., see the tms feature) is type "1a" or Type "1b". Unfortunately as defined this feature does not appear effective for detecting E.R. membrane proteins. Slightly more than half of all proteins have a value of "0" for this feature. Upon visual inspection the most apparent statistical trend is that 80-90% of cytoskeletal proteins have a value of "1", versus an average of 29-32% for all proteins. | ||||||||||||||||||||||||||||||||||||||||||||||||
pox | Developed by Kenta Nakai, a non-zero value for this feature indicates a possible PTS1 C-terminal
peroxisomal sorting signal.
The value depends on the most specific regular expression given in the table below which matches
the 3 C-terminal residues.
| ||||||||||||||||||||||||||||||||||||||||||||||||
px2 | Number of matches to the regular expression "[RK][LI]xxxxx[HQ]L" in any part of the sequence. This is intended to be a detector for PST2 peroxisomal targeting signals (for a review see McNew & Goodman, "The targeting and assembly of peroxisomal proteins: some old rules do not apply.") Trends Biochem Sci., 21(2):54-8, 1996). There seems to be a weak enrichment for this pattern in plant peroxisomal proteins. | ||||||||||||||||||||||||||||||||||||||||||||||||
vac | Number of matches to the regular expression "[TIK]LP[NKI]" in any part of the sequence. Intended to detect a possible vacuolar targeting signal, but does not show strong correlation with vacuolar localization in our current dataset (2007/03). | ||||||||||||||||||||||||||||||||||||||||||||||||
rnp | Number of matches to the regular expression "[RK]G[^EDRKHPCG][AGSCI][FY][LIVA].[FYM]" in any part of the sequence. Taken from PROSITE and an RNA-binding protein motif. Seems to detect a few chloroplast RNA binding proteins. | ||||||||||||||||||||||||||||||||||||||||||||||||
act | Number of matches to either of the following two regular expressions:
| ||||||||||||||||||||||||||||||||||||||||||||||||
caa | Presence of the CaaX motif or related motifs on the C-teminus for predicting isoprenylated proteins
as suggested in a review by Zhang & Casey, Annu Rev Biochem., 65:241-69, 1996. The value of this feature is given by the first regular expression in the following table which matches the C-terminus.
| ||||||||||||||||||||||||||||||||||||||||||||||||
yqr | This feature is motivated by the tyrosine based trans-Golgi localization signal for
membrane proteins described by Humphrey et al.,
"Localization of TGN38 to the trans-Golgi network: involvement of a cytoplasmic tyrosine-containing sequence",
The Journal of Cell Biology, 120:1123-35, 1993.
The score of this feature depends on the number of times "YQRL" appears anywhere in the sequence and the number of predicted transmembrane spanning regions
(i.e. the feature tms).
The exact value is given in this table:
| ||||||||||||||||||||||||||||||||||||||||||||||||
tyr | This feature, designed by Kenta Nakai, is motivated by fact that the presence of the
tyrosine-containing motif in the cytoplasmic tail of membrane proteins can be important for
selective inclusion in clathrin-coated vesicles (endocytosis) and lysosomal targeting.
The value of this feature is zero, except for proteins that are predicted to have exactly one membrane spanning region by the tms feature, and if the predicted cytoplasmic tail (i.e. the part of the sequence after the membrane spanning region) is no more than 50 residues. In that case the value is 10 times the fraction of tyrosines in the predicted cytoplasmic tail. Unfortunately, most proteins have a value of zero and the proteins with non-zero values also localize to a variety of sites. | ||||||||||||||||||||||||||||||||||||||||||||||||
leu | This feature, designed by Kenta Nakai, is motivated by fact that the presence of the
a dileucine motif "LL" in the cytoplasmic tail of membrane proteins can be important for
selective inclusion in clathrin-coated vesicles (endocytosis) and lysosomal targeting.
The value of this feature is zero, except for proteins that are predicted to have exactly one membrane spanning region by the tms feature. In that case the value is 10 times the number of (non-overlapping) occurrences of "LL" in the predicted cytoplasmic tail (i.e. the part of the sequence after the membrane spanning region). Statistically speaking all but 6-7% of all proteins have a value of "0", but the percentage of Golgi, E.R. and peroxisomal proteins with non-zero values tends to be somewhat higher:
| ||||||||||||||||||||||||||||||||||||||||||||||||
gpi | The feature was designed by Kenta Nakai, with the intention of predicting GPI-anchored proteins
based on the empirical knowledge that most of them are type Ia membrane proteins with very short
cytoplasmic tail (within 10 residues)
(Nakai and Kanehisa),
"A knowledge base for predicting protein localization sites in eukaryotic cells."
Genomics, 14:897-911, 1992.
The value of gpi is "1" if it is predicted to be a type Ia membrane protein with a cytoplasmic tail of less than 10 residues, otherwise "0". Unfortunately only 2 out of more than 10,000 proteins in our current (2007/3) dataset have a non-zero value for this feature. | ||||||||||||||||||||||||||||||||||||||||||||||||
myr | The feature was designed by Kenta Nakai, based on a regular expression from PROSITE for predicting N-myristoylated/palmitylated proteins.
The score is zero unless the sequence starts with residues matching M?G[^EDRKHPFYW]..[STAGCN][^P] at the N-terminus. ("[^EDRKHPFYW]" denotes any residue except aspartic acid, aspartate, arginine, etc). For proteins matching that pattern on their N-terminus, the score is one plus: one extra if the N-terminus is "MGC" or "GC", and one extra if the sequence has at least 2 lysine residues between its 4rd and 10th residues (inclusive on both ends). Note that although this definition currently allows for the possibility that the initial methionine may be missing, we recommend that the initial methionine always be included in query sequences. 99% of proteins have a zero value for this feature. There may be a slight enrichment of Golgi proteins amonst those with non-zero values, but proteins with a variety of localization sites are found as well. | ||||||||||||||||||||||||||||||||||||||||||||||||
dna | The number of matches to 63 PROSITE DNA binding motif regular expression (exact regular expressions in plain text). For a given regular expression the number of matches is the number of non-overlapping matches anywhere in the amino acid sequence. The value of "dna" is the sum of this over the collection of 63 regular expressions.
Although not generally causal (in some cases DNA binding regions do double duty as nuclear localization signals), this feature correlates strongly with nuclear localization. In the current (2007/3) dataset, the stastical performance of a non-zero value of this feature as a nuclear protein detector are:
Despite the presence of DNA, non-zero values do not appear to be enriched in mitochondrial or chloroplastic proteins. | ||||||||||||||||||||||||||||||||||||||||||||||||
rib | The number of matches to 71 PROSITE regular expression motifs for ribosomal proteins (exact regular expressions in plain text)
For a given regular expression the number of matches is the number of non-overlapping matches
anywhere in the amino acid sequence. The value of "rib" is the sum of this over the collection of
71 regular expressions.
A non-zero value for this features seems to be a fairly specific (but not very sensitive) detector of ribosomal proteins, including mitochondrial, chloroplastic and cytoplasmic ribosomes. | ||||||||||||||||||||||||||||||||||||||||||||||||
bac | The number of matches to 33 PROSITE regular expression motifs for prokaryotic DNA binding motifs (exact regular expressions in plain text)
For a given regular expression the number of matches is the number of non-overlapping matches
anywhere in the amino acid sequence. The value of "bac" is the sum of this over the collection
of 33 regular expressions.
Far less than 1% of eukaryotic proteins show a non-zero value for this feature and those that do are not enriched for the nucleus. | ||||||||||||||||||||||||||||||||||||||||||||||||
Membrane Topology Features | Several binary features reflect the predicted membrane topology of proteins (including
non-membrane). Exactly one has a value of "1" and the rest "0".
The prediction method combines the value of
four other PSORT II features: psg, gvh, alm,
and top.
The exact computation is difficult to describe
succintly (the only way to really understand it is to download the standalone package and study the
perl code in the file "sub1.pl").
The following table describes each of the membrane topology features:
As of (2007/3), in the animal dataset 97% of all plasma membrane proteins are predicted to be of type "m3a" or "m3b", and conversely 90% of all proteins predicted to be "m3a" or "m3b" are plasma membrane proteins, e.g. as a detector for plasma membrane proteins "m3a or m3b" has a sensitivity of 97% and specificity of 90%. These numbers drop to (95%, 62%) for fungi and (80%, 55%) for the plant dataset. Most likely most of this difference simply an artifact of biases in our datasets, but may suggest that tuning the predictors for plant proteins might improve the prediction performance. |