Abstract
With the rapid development of high-throughput technologies, systems biology is now embracing a great opportunity made possible by the increased accumulation of data available online. Biological data analytics is considered as a critical means to contribute to a better understanding on such data through extraction of the latent features, relationships and the associated mechanisms. Therefore, it is important to evaluate how to involve data analytics from both computational and biological perspectives in practice. This paper has investigated interaction relationships in the proteomics area, which provide insights of the critical molecular processes within infection mechanisms. Specifically, we focused on host–pathogen protein–protein interactions, which represented the primary challenges associated with infectious diseases and drug design. Accordingly, a novel framework based on data analytics and machine learning techniques is detailed for analyzing these areas and we will describe the analytical results from host–pathogen protein–protein interactions (HP-PPI). Based on this framework, which serves as a pipeline solution for extracting and learning from the raw proteomics data, we have firstly evaluated several models from literature using different analytic technologies and performance measurements. An unsupervised deep learning model based on stacked denoising autoencoders, is subsequently proposed to capture higher level feature regarding the sequence information in the framework. The achieved performance indicates a superior capability of the unsupervised deep learning model in dealing with the host–pathogen protein interactions scenario among all of these models. The results will further help to enrich a theoretical and technical foundation for analyzing HP-PPI networks.
Similar content being viewed by others
References
Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, et al (2015) Tensorflow: large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow. org, 1
Akusok A, Björk K-M, Miche Y, Lendasse A (2015) High-performance extreme learning machines: a complete toolbox for big data applications. IEEE Access 3:1011–1025
Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Holko M et al (2013) Ncbi geo: archive for functional genomics data sets–update. Nucleic Acids Res 41(D1):D991–D995
Berg JM, Tymoczko JL, Stryer L (2002) Biochemistry. Freeman, New York. ISBN-10: 0-7167-3051-0
Calderone A, Licata L, Cesareni G (2014) VirusMentha: a new resource for virus-host protein interactions. Nucleic Acids Res 43(D1):D588–D592
Chang C-C, Lin C-J (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3):27
Chaudhari P, Agarwal H, Bhateja V (2019) Data augmentation for cancer classification in oncogenomics: an improved KNN based approach. Evol Intell. https://doi.org/10.1007/s12065-019-00283-w
Chen H, Shen J, Wang L, Song J (2016) Towards data analytics of pathogen–host protein–protein interaction: a survey. In: 2016 IEEE International Congress on Big Data (BigData Congress), IEEE, pp 377–388
Chen H, Shen J, Wang L, Song J (2017) Leveraging stacked denoising autoencoder in prediction of pathogen–host protein–protein interactions. In: 2017 IEEE International Congress on Big Data (BigData Congress), IEEE, pp 368–375
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
Dagher GG, Machado AP, Davis EC, Green T, Martin J, Ferguson M (2019) Data storage in cellular DNA: contextualizing diverse encoding schemes. Evol Intell. https://doi.org/10.1007/s12065-019-00202-z
Davies MN, Secker A, Freitas AA, Clark E, Timmis J, Flower DR (2008) Optimizing amino acid groupings for GPCR classification. Bioinformatics 24(18):1980–1986
Du Z, Li L, Chen C-F, Philip SY, Wang JZ (2009) G-SESAME: web tools for GO-term-based gene similarity analysis and knowledge discovery. Nucleic Acids Res 37(Suppl_2):W345–W349
Gao M, Zhou H, Skolnick J (2019) Destini: a deep-learning approach to contact-driven protein structure prediction. Sci Rep 9(1):3514
Gene Ontology C et al (2015) Gene ontology consortium: going forward. Nucleic Acids Res 43(D1):D1049–D1056
Goel R, Harsha H, Pandey A, Prasad TK (2012) Human protein reference database and human proteinpedia as resources for phosphoproteome analysis. Mol BioSyst 8(2):453–463
Greene CS, Tan J, Ung M, Moore JH, Cheng C (2014) Big data bioinformatics. J Cell Physiol 229(12):1896–1900
Guo Y, Yu L, Wen Z, Li M (2008) Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences. Nucleic Acids Res 36(9):3025–3030
Hilbe JM (2009) Logistic regression models. CRC Press, USA
Huang G-B, Zhu Q-Y, Siew C-K (2006) Extreme learning machine: theory and applications. Neurocomputing 70(1):489–501
Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y (2001) A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci 98(8):4569–4574
Kshirsagar M, Carbonell J, Klein-Seetharaman J (2013a) Multisource transfer learning for host–pathogen protein interaction prediction in unlabeled tasks. NIPS Work Mach Learn Comput Biol 1:3–6
Kshirsagar M, Carbonell J, Klein-Seetharaman J (2013b) Multitask learning for host–pathogen protein interactions. Bioinformatics 29(13):i217–i226
Kshirsagar M, Schleker S, Carbonell J, Klein-Seetharaman J (2015) Techniques for transferring host-pathogen protein interactions knowledge to new tasks. Front Microbiol 6:36
Kumar R, Nanduri B (2010) Hpidb—a unified resource for host–pathogen interactions. BMC Bioinf 11(6):1
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444
Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22(13):1658–1659
Masood MMD, Manjula D, Sugumaran V (2018) Identification of new disease genes from protein–protein interaction network. J Ambient Intell Human Comput. https://doi.org/10.1007/s12652-018-0788-1
Mei S, Zhu H (2015) A novel one-class svm based negative data sampling method for reconstructing proteome-wide htlv–human protein interaction networks. Sci Rep 5:8034
Min S, Lee B, Yoon S (2017) Deep learning in bioinformatics. Brief Bioinf 18(5):851–869
Navratil V, de Chassey B, Meyniel L, Delmotte S, Gautier C, André P, Lotteau V, Rabourdin-Combe C (2009) Virhostnet: a knowledge base for the management and the analysis of proteome-wide virus–host interaction networks. Nucleic Acids Res 37(suppl 1):D661–D668
Panda B, Majhi B (2018) A novel improved prediction of protein structural class using deep recurrent neural network. Evol Intell. https://doi.org/10.1007/s12065-018-0171-3
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12(Oct):2825–2830
Prabukumar M, Agilandeeswari L, Ganesan K (2019) An intelligent lung cancer diagnosis system using cuckoo search optimization and support vector machine classifier. J Ambient Intell Hum Comput 10(1):267–293
Qi Y, Tastan O, Carbonell JG, Klein-Seetharaman J, Weston J (2010) Semi-supervised multi-task learning for predicting interactions between hiv-1 and human proteins. Bioinformatics 26(18):i645–i652
Savage N (2014) Bioinformatics: big data versus the big c. Nature 509(7502):S66–S67
Schleker S, Kshirsagar M, Klein-Seetharaman J (2015) Comparing human–Salmonella with plant–Salmonella protein–protein interaction predictions. Front Microbiol 6:45
Sen R, Nayak L, De RK (2016) A review on host-pathogen interactions: classification and prediction. Eur J Clin Microbiol Infect Dis 35(10):1581–1599
Shen J, Zhang J, Luo X, Zhu W, Yu K, Chen K, Li Y, Jiang H (2007) Predicting protein–protein interactions based only on sequences information. Proc Natl Acad Sci 104(11):4337–4341
Soyemi J, Isewon I, Oyelade J, Adebiyi E (2018) Inter-species/host–parasite protein interaction predictions reviewed. Curr Bioinf 13(4):396–406
Tekir SD, Çakır T, Ardıç E, Sayılırbaş AS, Konuk G, Konuk M, Sarıyer H, Uğurlu A, Karadeniz İ, Özgür A et al (2013) Phisto: pathogen–host interaction search tool. Bioinformatics 29(10):1357–1358
Tomasiello S (2019) A granular functional network classifier for brain diseases analysis. Comput Methods Biomech Biomed Eng Imaging Vis. https://doi.org/10.1080/21681163.2019.1627910
UniProt C et al (2008) The universal protein resource (uniprot). Nucleic Acids Res 36(suppl 1):D190–D195
Varadharajan R, Priyan MK, Panchatcharam P, Vivekanandan S, Gunasekaran M (2018) A new approach for prediction of lung carcinoma using back propogation neural network with decision tree classifiers. J Ambient Intell Human Comput. https://doi.org/10.1007/s12652-018-1066-y
Vincent P, Larochelle H, Bengio Y, Manzagol P-A (2008) Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th international conference on Machine learning, ACM, pp 1096–1103
Wang F, Liu S, Ni W, Xu Z, Qiu Z, Wan Z, Pan Z (2019) Imbalanced data classification algorithm with support vector machine kernel extensions. Evol Intell 12(3):341–347
Wattam AR, Abraham D, Dalay O, Disz TL, Driscoll T, Gabbard JL, Gillespie JJ, Gough R, Hix D, Kenyon R et al (2013) PATRIC, the bacterial bioinformatics database and analysis resource. Nucleic Acids Res 42(D1):D581–D591
Wikipedia (2017) Decision tree. Accessed 12 Dec 2017
Wikipedia (2017) Naive bayes classifier. Accessed 12 Dec 2017
Yan C, Xie H, Yang D, Yin J, Zhang Y, Dai Q (2018) Supervised hash coding with deep neural network for environment perception of intelligent vehicles. IEEE Trans Intell Transp Syst 19(1):284–295
You Z-H, Lei Y-K, Zhu L, Xia J, Wang B (2013) Prediction of protein-protein interactions from amino acid sequences with ensemble extreme learning machines and principal component analysis. BMC Bioinf 14(8):1
You Z-H, Li S, Gao X, Luo X, Ji Z (2014) Large-scale protein-protein interactions detection by integrating big biosensing data with computational model. BioMed Res Int. https://doi.org/10.1155/2014/598129
Zhang H (2004) The optimality of naive Bayes. AA 1(2):3
Zhang QC, Petrey D, Deng L, Qiang L, Shi Y, Thu CA, Bisikirska B, Lefebvre C, Accili D, Hunter T et al (2012) Structure-based prediction of protein–protein interactions on a genome-wide scale. Nature 490(7421):556–560
Acknowledgements
This work was supported by a scholarship from the China Scholarship Council (CSC) while the first author pursues his PhD degree in University of Wollongong, Australia. The first and second authors were also supported by UGPN RCF 2019 to visit University of Surrey to strengthen the algorithmic part and we wish to extend our deepest gratitude to Prof. Yaochu Jin for his valuable suggestions and supports in this paper.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Chen, H., Shen, J., Wang, L. et al. A framework towards data analytics on host–pathogen protein–protein interactions. J Ambient Intell Human Comput 11, 4667–4679 (2020). https://doi.org/10.1007/s12652-020-01715-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12652-020-01715-7