PRIMULE: Privacy risk mitigation for user profiles
Introduction
Nowadays, mobile devices record digital traces of different human activities such as movements, purchase transactions, preferences, opinions, and so on. Thus, they are an important source of information that enables the study of environmental monitoring, transportation, social networks, innovative demographic indexes and human behavior. In particular, the availability of CDR (Call Detail Record) data produced by mobile phones stimulated the research for sophisticated data mining algorithms suitable for understanding people habits and mobility patterns [1]. This type of data has been used also for monitoring population movements and displacement after disasters, such as earthquakes [2], for helping decision making in public health, particularly when considering the dynamics and spread of infectious diseases and the consequences of a natural disaster [3].
The opportunity of exploiting big data has attracted also the interest of official statistics [4]. Indeed, currently, a hot topic in official statistics is the exploitation of big data in combination with traditional data sources, in order to improve quality, timeliness and spatio-temporal granularity of statistical information. As an example, in [5], Furletti et al. presented the Sociometer, a data mining tool for classifying users by means of their calling habits, uses the calling activities to infer a presence indicator of different categories of people in a city. It takes advantage of a methodology able to construct an aggregate and compact user call profile.
The use of human data for both understanding social phenomena and the development of data-driven services is getting common, but at the same time, raises the concern on leakage of personal information or re-identification. In fact, numerous services have been temporarily put to halt or even out of service because of such issues.1 2 In practice, nowadays, the knowledge discovery related to human behavior comes with unprecedented opportunities and risks. The paradoxical situation we are facing is that we are running the risks, without fully catching the opportunities of big data. Indeed, on the one hand, we feel that our private space is vanishing in the digital world, and our personal data can be used without feedback and control; on the other hand, the same data are seized in the databases of companies (Telcom companies, insurance companies, and so on), which use legal constraints on privacy as a reason for not sharing it with science and society at large, keeping this precious source of knowledge locked to data analysts or service developers.
In Europe, policy-makers have addressed this problem with the General Data Protection Regulation (GDPR) [6]. This regulation responds to privacy and data protection threats associated with new data practices by strengthening protections for individuals, and also by harmonizing the legal framework to enable data to flow better within Europe. The GDPR introduces the practice of a Data Protection Impact Assessment and the application of the Privacy-by-Design principle in the creation of information systems. Thus, it is necessary to keep under control the privacy risk of users in the data and to enable knowledge discovery from raw data while preventing privacy violations by-design.
In this paper, we address the problem of guaranteeing privacy protection while using individual profiles for the extraction of additional knowledge, hidden in the data, by sophisticated data mining processes. In particular, our main goal is to guarantee privacy protection during the application of the Sociometer [5], that is considered a valuable tool for official statistics [7]. To this end, we propose PRIMULE (Privacy RIsk Mitigation for User profiLEs), a privacy risk mitigation strategy for making private a set of user profiles. PRIMULE relies on PRUDEnce [8], a privacy risk assessment framework that provides a methodology for systematically identifying risky-users in a set of data. PRIMULE, on the basis of the privacy risk assessment of user profiles, acts making similar profiles indistinguishable to eliminate possible risky profiles.
We conduct a detailed analysis of our approach using a real data set. In particular, we used a CDR dataset that covers 139 municipalities of Tuscany with 85 million CDRs from about 3 million customers in the month of November 2016 (4 weeks). The deep experimentation shows the effectiveness of PRIMULE. Indeed, after the privacy risk mitigation, the quality of the profiles is high in terms of similarity with respect to the original ones. This fact is also confirmed by the utility of the private profiles for the Sociometer, which is measured both in terms of classification and quantification performance. Empirical results demonstrate a good classification and quantification especially for the city user category of residents. In each experiment, we perform a comparison of PRIMULE against a method based on differential privacy [9]. Again, experiments show that our proposal provides much better results in terms of data quality and service utility.
The rest of this paper is organized as follow. Firstly, in Section 2 we report some relevant literature about privacy in mobile phone data. Then, in Section 3, we describe the basis of our work: (i) the individual user profile describing the calling activity, (ii) the Sociometer framework, and (iii) the PRUDEnce framework used for the assessment and the mitigation of the privacy risk. In Section 4, we introduce the problem definition while in Sections 5 Privacy risk assessment, 6 PRIMULE: Privacy RIsk Mitigation for User profiLEs we present the privacy attack model and our mitigation strategy PRIMULE. In Section 7, we show the results of our experiments on real data, bringing in evidence the effectiveness of our approach on both individual privacy and accuracy of the results. Finally, Section 8 concludes the paper.
Section snippets
Related work
Relatively little work has addressed privacy issues in the publication and analysis of GSM data. In the literature, many works treating mobile phone data state that there is no privacy issue or at least the privacy problems are mitigated by the high spatial coverage of the cell phone. However, Golle and Partridge [10] showed that a fraction of the US working population can be uniquely identified by their home and work locations even when those locations are not known at a fine scale or
Background
GSM (Global System for Mobile Communications) Network is a mobile network that enables the communications between mobile devices. The GSM protocol is based on a cellular network architecture, where a geographical area is covered by a number of antennas emitting a signal to be received by mobile devices. Each antenna covers an area called cell. In this way, the covered area is partitioned into a number of, possibly overlapping, cells, uniquely identified by the antenna. Cell horizontal radius
Problem definition and proposed solution
As discussed in Section 2, mobile phone data are subject to privacy issues. Our aim is to enable the sharing of this kind of data achieving two important but conflicting goals: on the one hand, we surely want that adequate privacy guarantees are provided, in order to limit the privacy risk of the individuals described in the data; on the other hand, shared data should not be too distorted, in order to ensure that specific analyses, such as the Sociometer (Section 3.2), are still possible,
Privacy risk assessment
The quantification of the probability of re-identification of each individual in the data requires to simulate a privacy attack. Our attack model is based on the linking attack [38], and it uses a specific and strong background knowledge. Indeed, in our setting, the attack is based on a perfect knowledge by the adversary of the call activities of his/her target in the observed area. In other words, for a specific time window and geographical area, the idea is to quantify the probability of
PRIMULE: Privacy RIsk Mitigation for User profiLEs
We figure out a method that is based on the knowledge of the Sociometer process and has the goal to get a set of safe ICPs. A profile is considered safe if it is indistinguishable from at least others . In other words, considering our attack model, a profile is safe if its probability of re-identification is at most , where is a parameter of PRIMULE (Algorithm 1).
Thus, our basic idea is to create groups of indistinguishable profiles by rendering equal those profiles which are already
Experiments
In this section, we compare the results of the evaluation of our approach, showing the outcome of the application of the Sociometer on three different sets of data. We present the Sociometer applied to ICPs treated with PRIMULE, our ad hoc mitigation strategy (Section 6.2), comparing its outcomes with the Sociometer applied to profiles without any kind of sanitization (i.e., this represents our baseline, since the ICPs are the original ones) and to profiles perturbed by an approach based on the
Conclusion
In this paper, we have studied the problem of guaranteeing privacy protection for individual user profiles, which describe the call activity registered by mobile phones. Our mitigation strategy, called PRIMULE, provides privacy protection by making similar profiles indistinguishable to eliminate possible risky cases. The proposed approach relies on the privacy risk assessment framework PRUDEnce [8] for the identification of risky profiles. PRIMULE is particularly tailored to individual call
Acknowledgment
This work has been supported by Project EU H2020-654024 SoBigData Infrastructure, Italy , under grant agreement No. 654024.
Francesca Pratesi received the MS degree and the Ph.D. in computer science from Pisa University, Italy, respectively in 2013 and in 2017. She is a Post Doc at the Computer Science Department of University of Pisa since March 2018 and a member of the Knowledge Discovery and Data Mining Lab, a joint research group with the Information Science and Technology Institute of the National Research Council in Pisa. Her research interests include data mining, data privacy and privacy risk assessment,
References (47)
- et al.
MP4-A Project: Mobility planning for Africa
- et al.
Improved response to disasters and outbreaks by tracking population movements with mobile phone network data: A post-earthquake geospatial study in haiti
PLoS Med.
(2011) - et al.
Mobile network data for public health: Opportunities and challenges
Front. Public Health
(2015) - et al.
Challenges and opportunities with mobile phone data in official statistics
- et al.
Analysis of gsm calls data for understanding user mobility behavior
General data protection regulation
- et al.
Use of mobile phone data to estimate mobility flows. measuring urban population and inter-city mobility using big data in an integrated approach
- et al.
Prudence: a system for assessing privacy risk vs utility in data sharing ecosystems
Trans. Data Priv.
(2018) Differential privacy
- et al.
On the anonymity of home/work location pairs
Anonymization of location data does not work: A large-scale measurement study
Sequenced release of privacy accurate Call data record information in a GSM forensic investigation.
Identification via location-profiling in gsm networks
Mapping the privacy-utility tradeoff in mobile phone data for development
Unique in the crowd: The privacy bounds of human mobility
Sci. Rep.
On the privacy-conscientious use of mobile phone data
Sci. Data
Privacy increase on telecommunication processes
Revisiting online anonymization algorithms to ensure location privacy
J. Ambient Intell. Humaniz. Comput.
A survey of results on mobile phone datasets analysis
EPJ Data Sci.
Human mobility modeling at metropolitan scales
Privacy by Design: The 7 Foundational Principles
Privacy-by-design in big data analytics and social mining
EPJ Data Sci.
Cited by (9)
Wdt-SCAN: Clustering decentralized social graphs with local differential privacy
2023, Computers and SecurityCitation Excerpt :Clustering on these star graphs can provide users global community structure, which serves as fundamental operations for a wide range of graph analysis tasks (Gao et al., 2018; Liu et al., 2020; Wei et al., 2020), e.g., friend recommendation and behavior prediction. However, users are often unwilling to share their contact lists without restriction to form a global graph due to privacy concerns (Pratesi et al., 2020; Rafiei and van der Aalst, 2021). Therefore, decentralized privacy-preserving graph clustering has become an urgent problem to be solved.
From user-generated data to data-driven innovation: A research agenda to understand user privacy in digital markets
2021, International Journal of Information ManagementCitation Excerpt :In order to become part of this new game system, companies have adopted new and innovative strategies for the collection, analysis and processing of these datasets (Bandara, Fernando, & Akter, 2020; Judson, Devasagayam, & Buff, 2012). The final objective is to trend the behavior of users in digital markets and then use this knowledge to segment advertising so that to increase profits and profitability (Pratesi, Gabrielli, Cintia, Monreale, & Giannotti, 2020). Accordingly, companies apply DDI models to better understand user behavior and develop strategies focused on information management and decision making (Prince, 2018).
Trustworthy AI at KDD Lab
2023, CEUR Workshop Proceedings
Francesca Pratesi received the MS degree and the Ph.D. in computer science from Pisa University, Italy, respectively in 2013 and in 2017. She is a Post Doc at the Computer Science Department of University of Pisa since March 2018 and a member of the Knowledge Discovery and Data Mining Lab, a joint research group with the Information Science and Technology Institute of the National Research Council in Pisa. Her research interests include data mining, data privacy and privacy risk assessment, mainly in spatio-temporal data.
Dr. Lorenzo Gabrielli is a Data Scientist at the National Research Council for developing innovative tools in the domain of migration flows and population trends. Over the last years, he has gained experience in the analysis of Big Data with Data Mining and Machine Learning techniques in a national and international context collaborating with several public and private research institutes. His interests regard mobility data mining with heterogeneous data as detecting urban mobility patterns and anomalies, study individual and collective mobility behavior, semantic enrichment of movements, study of the capability of public transport to attract private vehicular trips.
Paolo Cintia born in Marsciano(PG), 4th november 1983. I got my Ph.D. in Computer Science in 2015 and I am a post-doc researcher at the University of Pisa. My research activity is focused on Sports Data Science and Mobility Data Mining. I am co-founder of Playerank srl, a spin-off company of the University of Pisa keen on developing AI tools for Sports Analytics.
Anna Monreale is an assistant professor at the Computer Science Department of the University of Pisa and a member of the Knowledge Discovery and Data Mining Laboratory (KDD-Lab), a joint research group with the Information Science and Technology Institute of the National Research Council in Pisa. She has been a visiting student at Department of Computer Science of the Stevens Institute of Technology (Hoboken, NewJersey, USA) (2010). Her research interests include big data analytics, social networks and the privacy issues raising in mining these kinds of social and human sensitive data. In particular, she is interested in the evaluation of privacy risks during analytical processes and in the design of privacy-by-design technologies in the era of big data. She earned her Ph.D. in computer science from the University of Pisa in June 2011 and her dissertation was about privacy-by-design in data mining.
Fosca Giannotti is a director of research of computer science at the Information Science and Technology Institute “A. Faedo” of the National Research Council, Pisa, Italy. Fosca Giannotti is a pioneering scientist in mobility data mining, social network analysis and privacy-preserving data mining. Fosca leads the Pisa KDD Lab - Knowledge Discovery and Data Mining Laboratory, a joint research initiative of the University of Pisa and ISTI-CNR, founded in 1994 as one of the earliest research lab on data mining. Fosca’s research focus is on social mining from big data: smart cities, human dynamics, social and economic networks, ethics and trust, diffusion of innovations. She is author of more than 300 papers. She has coordinated tens of European projects and industrial collaborations. Fosca is currently the coordinator of SoBigData, the European research infrastructure on Big Data Analytics and Social Mining, an ecosystem of ten cutting edge European research centres providing an open platform for interdisciplinary data science and data-driven innovation. Recently she became the recipient of a prestigious ERC Advanced Grant entitled XAI – Science and technology for the explanation of AI decision making.