PSORT II Localization Features

Last Updated $Date: 2008/11/06 02:15:41 $

This page describes the PSORT II variables which are potentially used by WoLF PSORT. However since WoLF PSORT uses a feature selection process to discard features which are not clearly useful for prediction, not all of the feature variables described here are actually used for WoLF PSORT prediction. You may still obtain information about the value of these unused features by following the links to the PSORT II verbose output.

Feature Index:
a-l:   act, alm, bac, caa, dna, erl, erm, gpi, gvh, leu
 m :   m1a, m1b, m2, mNt, m3a, m3b, m_, mip, mit, myr
n-z:   nuc, pox, px2, psg, rib, rnp, tms, tyr, top, vac, yqr

NameDescription
psg Score for the presence of signal peptides; the algorithm was basically developed by McGeoch, "On the predictive recognition of signal peptide sequences", Virus Res., 3(3):271-286. 1985. This feature was further expanded by Nakai and Kanehisa in 1992; published in "Refinement of the prediction methods of signal peptides for the genome analyses of Saccharomyces cerevisiae and Bacillus subtilis", Genome Informatics, 1996.
gvh Original weight-matrix score of von Heijne, "A new method for predicting signal sequence cleavage sites", Nucleic Acids Res., 14:4683-90, 1986. Used for the recognition of cleavage sites of eukaryotic signal peptides. (For a no longer relevant historical technical reason, the value actually computed (and displayed) has 7.5 subtracted from it).
alm Discriminant score, calculated from the most hydrophobic 17 residue segment, given by the algorithm of Klein, Kanehisa, DeLisi, "The detection and classification of membrane-spanning proteins", Biochim Biophys Acta., 815(3):468-76, 1985.
top Score for predicting the topology of membrane proteins, by Hartmann, Rapoport, Lodish's method, "Predicting the orientation of eukaryotic membrane-spanning proteins", Proc Natl Acad Sci USA., 86(15):5786-90, 1989. It is essentially the net charge difference of 15 residues flanking the most N-terminal transmembrane segment on both sides.
tms Predicted number of transmembrane segments (except the cleavable signal peptide), given by Klein, Kanehisa, DeLisi's algorithm, "The detection and classification of membrane-spanning proteins", Biochim Biophys Acta., 815(3):468-76, 1985.
mit Score for the presence of an N-terminal mitochondrial targeting signal, which is calculated from the amino acid composition of the N-terminal 20 residues. Developed by Nakai and Kanehisa, "A knowledge base for predicting protein localization sites in eukaryotic cells", Genomics, 14:897-911, 1992.
mip Predicted position of the cleavage sites of mitochondrial targeting signals by Gavel and von Heijne's method, "Cleavage-site motifs in mitochondrial targeting peptides.", Protein Eng., 4(1):33-7, 1990.
nuc Discriminant score for being a nuclear protein, calculated from the presence of NLS motif, bipartite motif, and the amino acid composition, by Nakai and Kanehisa. "A knowledge base for predicting protein localization sites in eukaryotic cells", Genomics, 14:897-911, 1992.

More specifically, this feature is a meta-feature consisting of features described below. Several of the features check for the presence of either one of the basic residues 'R' or 'K'. Although not a standard notation, for convenience I use 'B' here to denote a residue which is either 'R' or 'K'.
namedefinition
localPat4 For a window of 4 residues:
Window ContentsScore
all Bs5
three Bs and one P4
three Bs and H3
otherwise0
pat4 Sum of localPat4 over all (including overlapping) length four windows in the sequence.
localPat7 For a window of seven residues. The score is determined by the first condition which hold from this list:
ConditionScore
The first residue is not a proline.0
The window contains a run of four contiguous B residues5
The window contains four contigous residues containing three B residues separated by d residues from the initial proline5-d
pat7 Sum of localPat7 over all (including overlapping) length seven windows in the sequence.
localBipartite For a window of 17 residues. The score is determined by the first condition which hold from this list:
ConditionScore
The first two residues are not BB0
The number of B residues in the final 5 residues < 20
otherwiseThe number of B residues in the final 5 residues
bipartite Sum of localBipartite over all (including overlapping) length 17 windows in the sequence.
nucaa Let fracB be the fraction of B residues throughout the entire sequence.
ConditionScore
fracB < 0.20
otherwisefracB * 10 -1
These features are currently summarized into the "nuc" feature by the following formula:

nuc = 0.0627 * pat4 + 0.0862 * pat7 + 0.0822 * bipartite + 0.9497 * nucl - 0.4738
The coefficients were derived by Kenta Nakai circa 1995 using linear discriminant analysis on a small dataset. The aggregation of various features was done for historical reasons. In the future we will probably use (and give more mnemonic names to) 'pat4', 'pat7', and 'bipartite' as independent features exposed to the WoLF PSORT selection and weighting process. The 'nucaa' feature is largely redundant since the WoLF PSORT candidate feature list includes the overall composition of each amino acid.

Using this feature alone, with a threshold of (nuc > 1), one obtains a (sensitivity, specificity) of (30%, 71%), (19%, 65%), and (21%, 77%) for animal, plant, and fungi respectively. Graphical Summary.

erl "1" if the last four C-terminal residues are "KDEL" or "HDEL", 0 otherwise. As an E.R. protein detector, this feature has modest sensitivity (9-25%) but excellent specificity. In our current (2007/3) dataset, all of the proteins with a score of "1" are E.R. proteins.
erm Score originally intended to detect ER membrane protein, calculated from the presence of the retention signals and the membrane topology. PSORT II original feature, Nakai & Horton, "PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization", Trends in Biochemical Sciences, 24(1):34-5, 1999.

The actual definition of "erm"'s value is slighly complicated. It is the sum of two quantities, which we will call N-score and C-score here. Let the amino acid sequence be denoted as S1,S2, ... Sn.

The N-score is the number of arginine residues in the first four residues skipping the most C-terminal residue (i.e. S2S3S4S5), plus a bonus of 2 if those residues contain at least one arginine and the predicted membrane type (by the method of Klein et al., see the tms feature) is type "2" or Type "Nt". The N-score is the number of lysine residues in the last four residues up to but not including the most C-terminal residue (i.e. Sn-4Sn-3Sn-2Sn-1), plus a bonus of 2 if those residues contain at least one lysine and the predicted membrane type (by the method of Klein et al., see the tms feature) is type "1a" or Type "1b".

Unfortunately as defined this feature does not appear effective for detecting E.R. membrane proteins. Slightly more than half of all proteins have a value of "0" for this feature. Upon visual inspection the most apparent statistical trend is that 80-90% of cytoskeletal proteins have a value of "1", versus an average of 29-32% for all proteins.

pox Developed by Kenta Nakai, a non-zero value for this feature indicates a possible PTS1 C-terminal peroxisomal sorting signal. The value depends on the most specific regular expression given in the table below which matches the 3 C-terminal residues.
Regular ExpressionScore
SKL5/6
SKF1/2
[SAGCN][RKH][LIVMAF]1/4
no match0
px2 Number of matches to the regular expression "[RK][LI]xxxxx[HQ]L" in any part of the sequence. This is intended to be a detector for PST2 peroxisomal targeting signals (for a review see McNew & Goodman, "The targeting and assembly of peroxisomal proteins: some old rules do not apply.") Trends Biochem Sci., 21(2):54-8, 1996). There seems to be a weak enrichment for this pattern in plant peroxisomal proteins.
vac Number of matches to the regular expression "[TIK]LP[NKI]" in any part of the sequence. Intended to detect a possible vacuolar targeting signal, but does not show strong correlation with vacuolar localization in our current dataset (2007/03).
rnp Number of matches to the regular expression "[RK]G[^EDRKHPCG][AGSCI][FY][LIVA].[FYM]" in any part of the sequence. Taken from PROSITE and an RNA-binding protein motif. Seems to detect a few chloroplast RNA binding proteins.
act Number of matches to either of the following two regular expressions:
[EQ]..[ATV]F..W.N
[LIVM].[SGN][LIVM][DAGHE][SAG].[DEAG][LIVM].[DEAG]....[LIVM].L[SAG][LIVM][LIVM]W.[LIVM][LIVM]
in any part of the sequence. These patterns were taken from PROSITE in the hope of detecting some cytoskeletal proteins but only a few proteins (e.g. plastin-2) match these regeps.
caa Presence of the CaaX motif or related motifs on the C-teminus for predicting isoprenylated proteins as suggested in a review by Zhang & Casey, Annu Rev Biochem., 65:241-69, 1996. The value of this feature is given by the first regular expression in the following table which matches the C-terminus.
Regular ExpressionScore
C[^DENQ][LIVM].$2
C.C$1
CC..$1
no match0
In practice only a few proteins match these patterns and the ones that do match localize to a variety of sites.
yqr This feature is motivated by the tyrosine based trans-Golgi localization signal for membrane proteins described by Humphrey et al., "Localization of TGN38 to the trans-Golgi network: involvement of a cytoplasmic tyrosine-containing sequence", The Journal of Cell Biology, 120:1123-35, 1993. The score of this feature depends on the number of times "YQRL" appears anywhere in the sequence and the number of predicted transmembrane spanning regions (i.e. the feature tms).

The exact value is given in this table:
# predicted transmembrane
spanning regions
Score
00
13 + # occurrences of "YQRL" in sequence
> 1# occurrences of "YQRL" in sequence
Although "YQRL" has been experimentally verified as a genuine trans-Golgi localization signal, as currently defined this feature does not effectively discriminate the Golgi from other sites in the secretory pathway.

tyr This feature, designed by Kenta Nakai, is motivated by fact that the presence of the tyrosine-containing motif in the cytoplasmic tail of membrane proteins can be important for selective inclusion in clathrin-coated vesicles (endocytosis) and lysosomal targeting.

The value of this feature is zero, except for proteins that are predicted to have exactly one membrane spanning region by the tms feature, and if the predicted cytoplasmic tail (i.e. the part of the sequence after the membrane spanning region) is no more than 50 residues. In that case the value is 10 times the fraction of tyrosines in the predicted cytoplasmic tail. Unfortunately, most proteins have a value of zero and the proteins with non-zero values also localize to a variety of sites.

leu This feature, designed by Kenta Nakai, is motivated by fact that the presence of the a dileucine motif "LL" in the cytoplasmic tail of membrane proteins can be important for selective inclusion in clathrin-coated vesicles (endocytosis) and lysosomal targeting.

The value of this feature is zero, except for proteins that are predicted to have exactly one membrane spanning region by the tms feature. In that case the value is 10 times the number of (non-overlapping) occurrences of "LL" in the predicted cytoplasmic tail (i.e. the part of the sequence after the membrane spanning region).

Statistically speaking all but 6-7% of all proteins have a value of "0", but the percentage of Golgi, E.R. and peroxisomal proteins with non-zero values tends to be somewhat higher:
Percentage of proteins with non-zero "leu" values
Organism type Golgi E.R. Peroxisome All Proteins
animal1418407
plant242086
fungi1114137

gpi The feature was designed by Kenta Nakai, with the intention of predicting GPI-anchored proteins based on the empirical knowledge that most of them are type Ia membrane proteins with very short cytoplasmic tail (within 10 residues) (Nakai and Kanehisa), "A knowledge base for predicting protein localization sites in eukaryotic cells." Genomics, 14:897-911, 1992.

The value of gpi is "1" if it is predicted to be a type Ia membrane protein with a cytoplasmic tail of less than 10 residues, otherwise "0".

Unfortunately only 2 out of more than 10,000 proteins in our current (2007/3) dataset have a non-zero value for this feature.

myr The feature was designed by Kenta Nakai, based on a regular expression from PROSITE for predicting N-myristoylated/palmitylated proteins.

The score is zero unless the sequence starts with residues matching M?G[^EDRKHPFYW]..[STAGCN][^P] at the N-terminus. ("[^EDRKHPFYW]" denotes any residue except aspartic acid, aspartate, arginine, etc). For proteins matching that pattern on their N-terminus, the score is one plus: one extra if the N-terminus is "MGC" or "GC", and one extra if the sequence has at least 2 lysine residues between its 4rd and 10th residues (inclusive on both ends). Note that although this definition currently allows for the possibility that the initial methionine may be missing, we recommend that the initial methionine always be included in query sequences.

99% of proteins have a zero value for this feature. There may be a slight enrichment of Golgi proteins amonst those with non-zero values, but proteins with a variety of localization sites are found as well.

dna The number of matches to 63 PROSITE DNA binding motif regular expression (exact regular expressions in plain text). For a given regular expression the number of matches is the number of non-overlapping matches anywhere in the amino acid sequence. The value of "dna" is the sum of this over the collection of 63 regular expressions.

Although not generally causal (in some cases DNA binding regions do double duty as nuclear localization signals), this feature correlates strongly with nuclear localization. In the current (2007/3) dataset, the stastical performance of a non-zero value of this feature as a nuclear protein detector are:
non-zero value of "dna" as a nuclear protein detector
Organism Type Sensitivity
 Pr[ non-zero | nucl ] 
Specificity
 Pr[ nucl | non-zero ] 
Animal52%95%
Plant26%93%
Fungi25%96%

Despite the presence of DNA, non-zero values do not appear to be enriched in mitochondrial or chloroplastic proteins.

rib The number of matches to 71 PROSITE regular expression motifs for ribosomal proteins (exact regular expressions in plain text) For a given regular expression the number of matches is the number of non-overlapping matches anywhere in the amino acid sequence. The value of "rib" is the sum of this over the collection of 71 regular expressions.

A non-zero value for this features seems to be a fairly specific (but not very sensitive) detector of ribosomal proteins, including mitochondrial, chloroplastic and cytoplasmic ribosomes.

bac The number of matches to 33 PROSITE regular expression motifs for prokaryotic DNA binding motifs (exact regular expressions in plain text) For a given regular expression the number of matches is the number of non-overlapping matches anywhere in the amino acid sequence. The value of "bac" is the sum of this over the collection of 33 regular expressions.

Far less than 1% of eukaryotic proteins show a non-zero value for this feature and those that do are not enriched for the nucleus.

Membrane
Topology
Features
Several binary features reflect the predicted membrane topology of proteins (including non-membrane). Exactly one has a value of "1" and the rest "0". The prediction method combines the value of four other PSORT II features: psg, gvh, alm, and top. The exact computation is difficult to describe succintly (the only way to really understand it is to download the standalone package and study the perl code in the file "sub1.pl").

The following table describes each of the membrane topology features:
Definition of PSORT II membrane topology features
NameDefinitionNote
m1a Membrane protein with a type 1a topology (having a cleavable signal sequence and one transmembrane segment) Very rare (< 1%) in all organism types
m1b membrane protein with the type 1b topology Seems to be slightly enriched for E.R. proteins, some of which are not membrane proteins (e.g., calreticulin), also somewhat enriched for secreted proteins.
m2 membrane protein with the type 2 topology substantially enriched for sites on the secretory pathway (including extracellular),
mNt membrane protein with the N-tail topology (having an uncleavable signal peptide and one transmembrane segment near its C-terminus) Enriched for peroxisomal proteins in the animal dataset (2007/03). Slighly enriched for sites on the secretory pathway for each eukaryotic organism type.
m3a membrane protein with the type 3a topology (multiple transmembrane regions with its N-teminus facing the cytosolic side) Strongly enriched for plasma membrane proteins (includes 52-70% of all plasma membrane proteins but only 7-20% of all proteins, by 2007/3 dataset). Also substantially enriched for membrane proteins in other secretory pathway sites.
m3b membrane protein with the type 3b topology (multiple transmembrane regions with its N-terminus facing the extra-cytosolic side With 3-7% coverage, this type is less common than "m3a", but also highly enriched for plasma membrane proteins and substantially enriched for membrane proteins in other secretory pathway sites.
m3b Not an integral membrane protein In our current datasets (2007/03) 53-78% of proteins are predicted to be soluble proteins. As expected, they are highly enriched for what appear to indeed be soluble proteins. Since this feature does not involve membrane topology prediction, it depends only on a prediction of zero membrane spanning regions by a slightly modified version of the alm method of Klein et al..

As of (2007/3), in the animal dataset 97% of all plasma membrane proteins are predicted to be of type "m3a" or "m3b", and conversely 90% of all proteins predicted to be "m3a" or "m3b" are plasma membrane proteins, e.g. as a detector for plasma membrane proteins "m3a or m3b" has a sensitivity of 97% and specificity of 90%. These numbers drop to (95%, 62%) for fungi and (80%, 55%) for the plant dataset. Most likely most of this difference simply an artifact of biases in our datasets, but may suggest that tuning the predictors for plant proteins might improve the prediction performance.