Abstract
An increasing number of people are sharing information through text messages, emails, and social media without proper privacy checks. In many situations, this could lead to serious privacy threats. This paper presents a methodology for providing extra safety precautions without being intrusive to users. We have developed and evaluated a model to help users take control of their shared information by automatically identifying text (i.e., a sentence or a transcribed utterance) that might contain personal or private disclosures. We apply off-the-shelf natural language processing tools to derive linguistic features such as part-of-speech, syntactic dependencies, and entity relations. From these features, we model and train a multichannel convolutional neural network as a classifier to identify short texts that have personal, private disclosures. We show how our model can notify users if a piece of text discloses personal or private information, and evaluate our approach in a binary classification task with 93% accuracy on our own labeled dataset, and 86% on a dataset of ground truth. Unlike document classification tasks in the area of natural language processing, our framework is developed keeping the sentence level context into consideration.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
It is worth mentioning that we get little fluctuation on the accuracy value while changing the number of neurons in these layers. It seems obvious because, this layer might have needed more neurons for better non-linearity understanding when it sees relatively more data.
- 2.
References
Abril, D., Navarro-Arribas, G., Torra, V.: On the declassification of confidential documents. In: Torra, V., Narakawa, Y., Yin, J., Long, J. (eds.) MDAI 2011. LNCS (LNAI), vol. 6820, pp. 235–246. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22589-5_22
Agerri, R., Artola, X., Beloki, Z., Rigau, G., Soroa, A.: Big data for natural language processing: a streaming approach. Knowl. Based Syst. 79, 36–42 (2015)
Andalibi, N., Öztürk, P., Forte, A.: Sensitive self-disclosures, responses, and social support on Instagram: the case of #depression. In: CSCW, pp. 1485–1500 (2017)
Bettini, C., Wang, X.S., Jajodia, S.: Protecting privacy against location-based personal identification. In: Jonker, W., Petković, M. (eds.) SDM 2005. LNCS, vol. 3674, pp. 185–199. Springer, Heidelberg (2005). https://doi.org/10.1007/11552338_13
Boyd, V.: Financial privacy in the United States and the European union: a path to transatlantic regulatory harmonization. Berkeley J. Int’l L. 24, 939 (2006)
Buchanan, T., Paine, C., Joinson, A.N., Reips, U.D.: Development of measures of online privacy concern and protection for use on the internet. J. Assoc. Inf. Sci. Technol. 58(2), 157–165 (2007)
Caliskan Islam, A., Walsh, J., Greenstadt, R.: Privacy detective: detecting private information and collective privacy behavior in a large social network. In: Proceedings of the 13th Workshop on Privacy in the Electronic Society, pp. 35–46. ACM (2014)
Chow, R., Golle, P., Staddon, J.: Detecting privacy leaks using corpus-based association rules. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 893–901. ACM (2008)
Christofides, E., Muise, A., Desmarais, S.: Information disclosure and control on facebook: are they two sides of the same coin or two different processes? Cyberpsychol. Behav. 12(3), 341–345 (2009)
Word Embedding Wikipedia Contributors: Word embedding — Wikipedia, the free Encyclopedia (2018). https://en.wikipedia.org/w/index.php?title=Word_embedding&oldid=836044700. Accessed 7 May 2018
Costello, J.: Nursing older dying patients: findings from an ethnographic study of death and dying in elderly care wards. J. Adv. Nurs. 35(1), 59–68 (2001)
Datafiniti: Hotel reviews — Kaggle (2018). https://www.kaggle.com/datafiniti/hotel-reviews. Accessed 01 May 2018
Dave, K., Lawrence, S., Pennock, D.M.: Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In: Proceedings of the 12th International Conference on World Wide Web, pp. 519–528. ACM (2003)
De Choudhury, M., De, S.: Mental health discourse on reddit: self-disclosure, social support, and anonymity. In: ICWSM (2014)
DeCew, J.W.: The priority of privacy for medical information. Soc. Philos. Policy 17(2), 213–234 (2000)
Evans, D.A., Zhai, C.: Noun-phrase analysis in unrestricted text for information retrieval. In: Proceedings of the 34th Annual Meeting on Association for Computational Linguistics, pp. 17–24. Association for Computational Linguistics (1996)
Stack Exchange: Stack exchange data dump. Stack Exchange, Inc.: Free Download, Borrow, and Streaming: Internet Archive (2018). https://archive.org/details/stackexchange. Accessed 01 May 2018
Ganesan, K., Zhai, C.: Opinion-based entity ranking. Inf. Retrieval 15(2), 116–150 (2012)
Groves, T.: Why is analyzing text so hard? (2018). http://www.ibmbigdatahub.com/blog/why-analyzing-text-so-hard. Accessed 01 Feb 2018
Hern, A.: Far more than 87m Facebook users had data compromised, MPs told (2018). https://www.theguardian.com/uk-news/2018/apr/17/facebook-users-data-compromised-far-more-than-87m-mps-told/-cambridge-analytica. Accessed 01 May 2018
Joinson, A.N., Reips, U.D., Buchanan, T., Schofield, C.B.P.: Privacy, trust, and self-disclosure online. Hum. Comput. Interact. 25(1), 1–24 (2010)
Joshaghani, R., Mehrpouyan, H.: A model-checking approach for enforcing purpose-based privacy policies. In: IEEE Symposium on Privacy-Aware Computing (PAC), pp. 178–179. IEEE (2017)
Keras: Embedding layers - Keras documentation (2018). https://keras.io/layers/embeddings/. Accessed 01 Feb 2018
Keras: Guide to the functional API - Keras documentation (2018). https://keras.io/getting-started/functional-api-guide/. Accessed 01 Feb 2018
Keras: Text preprocessing - Keras documentation (2018). https://keras.io/preprocessing/text/#tokenizer. Accessed 01 Feb 2018
Kravchik, M., Shabtai, A.: Anomaly detection; industrial control systems; convolutional neural networks. arXiv preprint arXiv:1806.08110 (2018)
Krishnamurthy, B., Wills, C.E.: On the leakage of personally identifiable information via online social networks. In: Proceedings of the 2nd ACM Workshop on Online Social Networks, pp. 7–12. ACM (2009)
LeCun, Y., Bengio, Y., et al.: Convolutional networks for images, speech, and time series. Handb. Brain Theor. Neural Netw. 3361(10), 1995 (1995)
LeCun, Y., et al.: Handwritten digit recognition with a back-propagation network. In: Advances in Neural Information Processing Systems, pp. 396–404 (1990)
Leyshon, A., Signoretta, P., Knights, D., Alferoff, C., Burton, D.: Walking with moneylenders: the ecology of the UK home-collected credit industry. Urban Stud. 43(1), 161–186 (2006)
LIWC: Linguistic inquiry and word count (2018). https://liwc.wpengine.com/. Accessed 01 February 2018
Madden, M.: Privacy management on social media sites. In: Pew Internet Report, pp. 1–20 (2012)
Madden, M., et al.: Teens, social media, and privacy. Pew Res. Center 21, 2–86 (2013)
Malhotra, N.K., Kim, S.S., Agarwal, J.: Internet Users’ Information Privacy Concerns (IUIPC): the construct, the scale, and a causal model. Inf. Syst. Res. 15(4), 336–355 (2004)
Mao, H., Shuai, X., Kapadia, A.: Loose tweets: an analysis of privacy leaks on twitter. In: Proceedings of the 10th Annual ACM Workshop on Privacy in the Electronic Society, pp. 1–12. ACM (2011)
McAuley, J.J., Leskovec, J.: From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 897–908. ACM (2013)
Meerabeau, L.: The management of embarrassment and sexuality in health care. J. Adv. Nurs. 29(6), 1507–1513 (1999)
Mehrpouyan, H., Azpiazu, I.M., Pera, M.S.: Measuring personality for automatic elicitation of privacy preferences. In: IEEE Symposium on Privacy-Aware Computing (PAC), vol. 00, pp. 84–95, August 2017. https://doi.org/10.1109/PAC.2017.15, doi.ieeecomputersociety.org/10.1109/PAC.2017.15
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Milberg, S.J., Burke, S.J., Smith, H.J., Kallman, E.A.: Values, personal information privacy, and regulatory approaches. Commun. ACM 38(12), 65–74 (1995)
Milne, D.N., Pink, G., Hachey, B., Calvo, R.A.: CLPsych 2016 shared task: triaging content in online peer-support forums. In: Proceedings of the Third Workshop on Computational Linguistics and Clinical Psychology, pp. 118–127 (2016)
Pak, A., Paroubek, P.: Twitter as a corpus for sentiment analysis and opinion mining. In: LREc, vol. 10 (2010)
Razavi, A.H., Ghazinour, K.: Personal health information detection in unstructured web documents. In: IEEE 26th International Symposium on Computer-Based Medical Systems (CBMS), pp. 155–160. IEEE (2013)
Sachs, J.S.: Recopition memory for syntactic and semantic aspects of connected discourse. Percept. Psychophys. 2(9), 437–442 (1967)
Sánchez, D., Batet, M., Viejo, A.: Detecting sensitive information from textual documents: an information-theoretic approach. In: Torra, V., Narukawa, Y., López, B., Villaret, M. (eds.) MDAI 2012. LNCS (LNAI), vol. 7647, pp. 173–184. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34620-0_17
Schrading, N., Alm, C.O., Ptucha, R., Homan, C.: An analysis of domestic abuse discourse on reddit. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 2577–2583 (2015)
Serenko, N., Fan, L.: Patients’ perceptions of privacy and their outcomes in healthcare. Int. J. Behav. Healthc. Res. 4(2), 101–122 (2013)
Siegel, A.: In pursuit of privacy: laws, ethics, and the rise of technology. Wilson Q. 21(4), 100 (1997)
Singh, J., Nene, M.J.: A survey on machine learning techniques for intrusion detection systems. Int. J. Adv. Res. Comput. Commun. Eng. 2(11), 4349–4355 (2013)
Solon, O.: Facebook says Cambridge Analytica may have gained 37m more users’ data (2018). https://www.theguardian.com/technology/2018/apr/04/facebook-cambridge-analytica-user-data-latest-more-than-thought. Accessed 01 May 2018
Spacy: Linguistic features (2018). https://spacy.io/usage/linguistic-features. Accessed 01 Feb 2018
Spacy: Named entity recognition (2018). https://prodi.gy/features/named-entity-recognition. Accessed 01 Feb 2018
Sweeney, L.: Replacing personally-identifying information in medical records, the scrub system. In: Proceedings of the AMIA Annual Fall Symposium, p. 333. American Medical Informatics Association (1996)
Turney, P.D., Pantel, P.: From frequency to meaning: vector space models of semantics. J. Artif. Intell. Res. 37, 141–188 (2010)
Vasalou, A., Gill, A.J., Mazanderani, F., Papoutsi, C., Joinson, A.: Privacy dictionary: a new resource for the automated content analysis of privacy. J. Assoc. Inf. Sci. Technol. 62(11), 2095–2105 (2011)
Wang, Y.C., Burke, M., Kraut, R.: Modeling self-disclosure in social networking sites. In: Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing, pp. 74–85. ACM (2016)
Yang, C.C., Tang, X.: Estimating user influence in the MedHelp social network. IEEE Intell. Syst. 27(5), 44–50 (2012)
Acknowledgments
The authors would like to thank National Science Foundation for its support through the Computer and Information Science and Engineering (CISE) program and Research Initiation Initiative(CRII) grant number 1657774 of the Secure and Trustworthy Cyberspace (SaTC) program: A System for Privacy Management in Ubiquitous Environments.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering
About this paper
Cite this paper
Mehdy, N., Kennington, C., Mehrpouyan, H. (2019). Privacy Disclosures Detection in Natural-Language Text Through Linguistically-Motivated Artificial Neural Networks. In: Li, J., Liu, Z., Peng, H. (eds) Security and Privacy in New Computing Environments. SPNCE 2019. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 284. Springer, Cham. https://doi.org/10.1007/978-3-030-21373-2_14
Download citation
DOI: https://doi.org/10.1007/978-3-030-21373-2_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-21372-5
Online ISBN: 978-3-030-21373-2
eBook Packages: Computer ScienceComputer Science (R0)