Data Fusion for cleavage target prediction
- Published
- Accepted
- Subject Areas
- Bioinformatics, Cell Biology
- Keywords
- Data Fusion, Protein Cleavage
- Copyright
- © 2016 Marini et al.
- Licence
- This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
- Cite this article
- 2016. Data Fusion for cleavage target prediction. PeerJ Preprints 4:e2172v1 https://doi.org/10.7287/peerj.preprints.2172v1
Abstract
Motivation: Protein cleavage is a pivotal process in cell metabolism. It is involved, among other processes, in cell differentiation and cycle control, stress and immune response, removal of abnormally folded proteins and cell death. Proteases (i.e. protein responsible for cleavage) account for ~2% of all gene products. As consequence, wrongly regulated proteolytic activity may result in diseases. The problem of predicting cleavage targets have been addressed by a number of algorithms. Traditional prediction models tackle the cleavage target machinery encoding directly related information to the outcome class (e.g. by extracting sequence patterns or frequency matrices). We are aware, however, that a huge amount of indirectly-related information is available in public data sets. Peptidases and targets are both proteins, and share similarities as well as non-cleavage interactions in knowledge bases; they are both encoded by genes, and gene interactions are also in databases. Our proposed Data Fusion algorithm leverages on these secondary information sources to infer novel peptidase targets.
Methods: Our approach is based on tri-factorization. The multiplicity of data are fused by inferring a joint model, and without altering their original structure, i.e. data are explicitly represented in the form of a relational block matrix R. Diagonal blocks of R are set to 0, while other blocks are sparse matrices, populated with the relations harvested from the various data sources. R elements are constrained into the range [0, 1], where 0s represent negative or unknown relationships, while 1s are interpreted as certain relationships. We considered three elements in our matrix, namely peptidases, targets and genes. From MEROPS we obtained 657 human peptidases affecting 3460 targets and forming 8931 pairs. From their mapping on Uniprot, 3833 genes coding for pepidases or targets were retained. This information was used to populate the peptidase-target, peptidase-gene and target-gene R blocks. During the data fusion process, each R block is decomposed into three sub-matrices, characterized by low dimensions (if compared to the original R block size). There is no clear consensus about a technique to define these dimensions, and we proceeded by choosing a rank for a given block based on the number of known interactions. Once the dimensions are set, the three sub-matrices are used to reconstruct a user-defined target block Rt. R decomposition is obtained through an iterative process, where constraint matrices play an important role. Constraint matrices are populated with the associations relating objects of the same type. In our application we utilized five constraints: one gene-gene interaction matrix from BIOGrid; two target-target and protease-protease interaction matrices from STRING (0.7 as combined score threshold); two target-target and protease-protease...
- - - Abstract truncated at 3,000 characters - the full version is available in the pdf file. - - -
Author Comment
This is an abstract which has been accepted for the BITS2016 Meeting.