Automated Protein Subfamily Identification and Classification

doi:10.1371/journal.pcbi.0030160

Automated Protein Subfamily Identification and Classification

Figure 3

Novel Subtype Identification and Classification Accuracy as a Function of the Threshold on Subfamily Membership Probability

(A) The red line shows the fraction of novel subfamilies correctly detected; the blue line shows the fraction of subfamily members correctly classified in leave-one-out experiments. Novelty detection is quite robust to the threshold setting, obtaining 80% success rate even at the lowest threshold (0.01).

(B) The fraction of sequences classified to an incorrect subfamily during leave-one-out experiments. While low to begin with, the false positive error drops dramatically with the imposition of even a small threshold. A threshold of 0.10 probability of subfamily membership seems to be optimal; the false-positive classification rate is just 0.3%, while overall subfamily classification and novel subtype detection accuracy are both 88%. The x-axis shows the logistic regression probability threshold for subfamily membership assignment.

doi: https://doi.org/10.1371/journal.pcbi.0030160.g003