Semi-supervised prediction of protein subcellular localization using abstraction augmented Markov models

Cornelia Caragea¹, Doina Caragea, Adrian Silvescu, Vasant Honavar

Affiliations

PMID: 21034431
PMCID: PMC2966293
DOI: 10.1186/1471-2105-11-S8-S6

Semi-supervised prediction of protein subcellular localization using abstraction augmented Markov models

Cornelia Caragea et al. BMC Bioinformatics. 2010.

. 2010 Oct 26;11 Suppl 8(Suppl 8):S6.

doi: 10.1186/1471-2105-11-S8-S6.

Authors

Cornelia Caragea¹, Doina Caragea, Adrian Silvescu, Vasant Honavar

Affiliation

¹ Artificial Intelligence Research Laboratory, Department of Computer Science,Iowa State University, Ames, IA 50010, USA. cornelia@cs.iastate.edu

PMID: 21034431
PMCID: PMC2966293
DOI: 10.1186/1471-2105-11-S8-S6

Abstract

Background: Determination of protein subcellular localization plays an important role in understanding protein function. Knowledge of the subcellular localization is also essential for genome annotation and drug discovery. Supervised machine learning methods for predicting the localization of a protein in a cell rely on the availability of large amounts of labeled data. However, because of the high cost and effort involved in labeling the data, the amount of labeled data is quite small compared to the amount of unlabeled data. Hence, there is a growing interest in developing semi-supervised methods for predicting protein subcellular localization from large amounts of unlabeled data together with small amounts of labeled data.

Results: In this paper, we present an Abstraction Augmented Markov Model (AAMM) based approach to semi-supervised protein subcellular localization prediction problem. We investigate the effectiveness of AAMMs in exploiting unlabeled data. We compare semi-supervised AAMMs with: (i) Markov models (MMs) (which do not take advantage of unlabeled data); (ii) an expectation maximization (EM); and (iii) a co-training based approaches to semi-supervised training of MMs (that make use of unlabeled data).

Conclusions: The results of our experiments on three protein subcellular localization data sets show that semi-supervised AAMMs: (i) can effectively exploit unlabeled data; (ii) are more accurate than both the MMs and the EM based semi-supervised MMs; and (iii) are comparable in performance, and in some cases outperform, the co-training based semi-supervised MMs.

PubMed Disclaimer

Figures

**Figure 1**
**Comparison of AAMMs with MMs**. Comparison of AAMMs with MMs for 1% (first row), 10% (second row), and 25% (third row) of labeled data available for **non-plant** (left), **plant** (center), and **psortNeg** (right), respectively.

**Figure 2**
**Comparison of AAMM(l+u) with AAMM(l), MMs, and EM-MMs.** Comparison of AAMMs trained using an abstraction hierarchy learned from both *labeled* and unlabeled data, AAMM(l+u), with (i) AAMMs trained using an abstraction hierarchy learned only from *labeled* data, AAMM(l); (ii) Expectation-Maximization with Markov models, EM-MM; and (iii) Markov models, MM, on **non-plant** (left), **plant** (center), and **psortNeg** (right) data sets. x axis indicates the number of labeled examples in each data set corresponding to fractions of 1%, 5%, 10%, 15%, 20%, 25%, 35%, 50% of training data being treated as labeled data. The fraction of unlabeled data in each data set is fixed to 50%.

**Figure 3**
**Comparison of AAMMs with EM-MMs.** Comparison of AAMMs with EM-MMs for three different fractions of labeled data (i.e., 1%, 10%, and 25%) while varying the amount of unlabeled data on **non-plant** (left), **plant** (center), and **psortNeg** (right) data sets. x axis indicates the number of unlabeled examples in each data set corresponding to fractions of 1%, 10%, 25%, 50%, 75%, 90%, 99% of training data being treated as unlabeled data.

**Figure 4**
**Comparison of AAMMs with co-training MMs**. Comparison of AAMMs with co-training MMs on **non-plant** (left), **plant** (center), and **psortNeg** (right) data sets. AAMMs are trained on the first 60 and the last 15 amino acids of each protein sequence, AAMM(60 + 15). Co-training MMs consists of two co-trained MMs, one trained on the first 60 amino acids of each sequence, the other trained on the last 15 amino acids of each sequence. x axis indicates the number of labeled examples in each data set corresponding to fractions of 1%, 5%, 10%, 15%, 20%, 25%, 35%, 50% of training data being treated as labeled data. The fraction of unlabeled data in each data set is fixed to 50%.

**Figure 5**
**Markov model for sequence classification**. Dependency of *X_i* on *X_i*₋_k,…,*X_i*₋₁ in a *k^th* order Markov model.

**Figure 6**
**Abstraction augmented Markov models**. (a) An abstraction hierarchy T on a set S = {s₁,…,s₉} of 2-grams over an alphabet of size 3. The abstractions a₁ to a₉ correspond to the 2-grams s₁ to s₉, respectively. The subset of nodes A = {a₁₅, a₆, a₁₄} represents a 3-cut γ₃ through T; (b) Dependency of *X_i* on *A_i*, which takes values in a set of abstractions A corresponding to an m-cut γ_m, in a *k_th* order AAMM.

See this image and copyright information in PMC

References

1. Alberts B, Bray D,et al, editor. Molecular Biology of the Cell. New York and London, Garland Publishing.; 1994.
1. Baldi P, Brunak S. Bioinformatics: the Machine Learning Approach. MIT Press; 2001.
1. Park K, Kanehisa M. Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics. 2003;19(13):1656–1663. doi: 10.1093/bioinformatics/btg222. - DOI - PubMed
1. Emanuelsson O, Nielsen H, Brunak S, von Heijne G. Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J. 2000;300:1005–1016. doi: 10.1006/jmbi.2000.3903. - DOI - PubMed
1. Höglund A, Donnes P, Blum T, Adolph HW, Kohlbacher O. MultiLoc: prediction of protein subcellular localization using N-terminal targeting sequences, sequence motifs, and amino acid composition. Bioinformatics. 2006;22(10):1158–1165. doi: 10.1093/bioinformatics/btl002. - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Semi-supervised prediction of protein subcellular localization using abstraction augmented Markov models

Affiliation

Semi-supervised prediction of protein subcellular localization using abstraction augmented Markov models

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources