Abstract
Text classification has been shown to be effective for assisting human reviewers to identify sensitive information when reviewing documents to release to the public. However, automatically classifying sensitive information is difficult, since sensitivity is often due to contextual knowledge that must be inferred from the text. For example, the mention of a specific named entity is unlikely to provide enough context to automatically know if the information is sensitive. However, knowing the conceptual role of the entity, e.g. if the entity is a politician or a terrorist, can provide useful additional contextual information. Human sensitivity reviewers use their prior knowledge of such contextual information when making sensitivity judgements. However, statistical or contextualized classifiers cannot easily resolve these cases from the text alone. In this paper, we propose a feature extraction method that models entities in a hierarchical structure, based on the underlying structure of Wikipedia, to generate a more informative representation of entities and their roles. Our experiments, on a test collection containing real-world sensitivities, show that our proposed approach results in a significant improvement in sensitivity classification performance (2.2% BAC, McNemar’s Test, p < 0.05) compared to a text based sensitivity classifier.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Berardi, G., Esuli, A., Macdonald, C., Ounis, I., Sebastiani, F.: Semi-automated text classification for sensitivity identification. In Proceedings of CIKM (2015)
Cormack, G.V., Grossman, M.R.: Evaluation of machine-learning protocols for technology-assisted review in electronic discovery. In Proceedings of SIGIR (2014)
Daiber, J., Jakob, M., Hokamp, C., Mendes, P.N.: Improving efficiency and accuracy in multilingual entity extraction. In Proceedings of I-SEMANTICS (2013)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Flisar, J., Podgorelec, V.: Improving short text classification using information from DBpedia ontology. Fundamenta Informaticae 172(3), 261–297 (2020)
Kapanipathi, P., Jain, P., Venkataramani, C., Sheth, A.: User interests identification on Twitter using a hierarchical knowledge base. In: Proceedings of ESWC (2014)
Liu, B., Zuccon, G., Hua, W., Chen, W.: Diagnosis ranking with knowledge graph convolutional networks. In: Hiemstra, D., Moens, M.-F., Mothe, J., Perego, R., Potthast, M., Sebastiani, F. (eds.) ECIR 2021. LNCS, vol. 12656, pp. 359–374. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-72113-8_24
McDonald, G., Macdonald, C., Ounis, I.: Using part-of-speech n-grams for sensitive-text classification. In: Proceedings of ICTIR (2015)
McDonald, G., Macdonald, C., Ounis, I.: Enhancing sensitivity classification with semantic features using word embeddings. In: Proceedings of ECIR (2017)
McDonald, G., Macdonald, C., Ounis, I.: Towards maximising openness in digital sensitivity review using reviewing time predictions. In: Proceedings of ECIR (2018)
McDonald, G., Macdonald, C., Ounis, I., Gollins, T.: Towards a classifier for digital sensitivity review. In: Proceedings of ECIR (2014)
Peng, Y., Yan, S., Lu, Z.: Transfer learning in biomedical natural language processing: an evaluation of BERT and ELMo on ten benchmarking datasets. In: Proceedings of BioNLP Workshop and Shared Task (2019)
Poerner, N., Waltinger, U., Schütze, H.: E-BERT: efficient-yet-effective entity embeddings for BERT. arXiv preprint arXiv:1911.03681 (2019)
Acknowledgements
E. Frayling, C. Macdonald and I. Ounis acknowledge the support of Innovate UK through a Knowledge Transfer Partnership (# 12040). All authors thank SVGC Ltd. for their support.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Frayling, E., Macdonald, C., McDonald, G., Ounis, I. (2022). Using Entities in Knowledge Graph Hierarchies to Classify Sensitive Information. In: Barrón-Cedeño, A., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2022. Lecture Notes in Computer Science, vol 13390. Springer, Cham. https://doi.org/10.1007/978-3-031-13643-6_10
Download citation
DOI: https://doi.org/10.1007/978-3-031-13643-6_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-13642-9
Online ISBN: 978-3-031-13643-6
eBook Packages: Computer ScienceComputer Science (R0)