Abstract
Privacy is a fundamental right that could be threatened by Information Retrieval (IR) models when applied and trained on sensitive data and personal user information. Although mechanisms have been proposed to protect user privacy, the effectiveness of the privacy protections is typically assessed by studying the relations between performance and parameters of the mechanisms, such as the privacy budget in Differential Privacy (DP). This often causes a disconnection between formal privacy and the privacy experienced by the user, the actual privacy. In this paper, we present the Query Inference for Privacy and Utility (QuIPU) framework, a novel evaluation paradigm to assess actual privacy based on the risk that an “honest-but-curious” IR system can infer the original query from the obfuscated queries received. QuIPU represents the first attempt at measuring actual privacy for IR tasks beyond the comparison of formal privacy parameters. Our analysis shows that formal privacy parameters do not imply actual privacy, causing scenarios where, for the same privacy parameter values, two systems provide different utility, but also different actual privacy. Therefore, there is a necessity for a proper way of assessing the risk, represented by QuIPU.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Remark on the notation: with \(\mathcal {T}\left( \mathcal {Q}_{\text {obf}}\right) ,\mathcal {T}\left( \mathcal {Q}_{\text {logs}}\right) \) we indicate the sets of text embeddings, and with \(\mathcal {T}(q'_i),\mathcal {T}(q_i)\) the singular vector embedding of the queries.
- 2.
- 3.
- 4.
The DP configurations with \(\varepsilon >1\) deviate from the “theoretically safe” privacy setup, i.e., strong assurance about the formal privacy introduced, see DP definition [21].
References
Ahmad, W.U., Chang, K., Wang, H.: Intent-aware query obfuscation for privacy protection in personalized web search. In: Collins-Thompson, K., Mei, Q., Davison, B.D., Liu, Y., Yilmaz, E. (eds.) The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR 2018, Ann Arbor, MI, USA, July 08-12, 2018, pp. 285–294. ACM (2018). https://doi.org/10.1145/3209978.3209983
Anderson, C.: The long tail. Mann, Ivanov & Ferber, Effective Business Model on the Internet-Moscow (2012)
Arampatzis, A., Drosatos, G., Efraimidis, P.: A versatile tool for privacy-enhanced web search. In: Serdyukov, P., et al. (eds.) Advances in Information Retrieval - 35th European Conference on IR Research, ECIR 2013, Moscow, Russia, March 24-27, 2013. Proceedings. Lecture Notes in Computer Science, vol. 7814, pp. 368–379. Springer (2013). https://doi.org/10.1007/978-3-642-36973-5_31
Bavadekar, S., et al.: Google COVID-19 search trends symptoms dataset: Anonymization process description (version 1.0). CoRR abs/2009.01265 (2020). https://arxiv.org/abs/2009.01265
Blanco-Justicia, A., Sánchez, D., Domingo-Ferrer, J., Muralidhar, K.: A critical review on the use (and misuse) of differential privacy in machine learning. ACM Comput. Surv. 55(8), 160:1–160:16 (2023). https://doi.org/10.1145/3547139
Bo, H., Ding, S.H.H., Fung, B.C.M., Iqbal, F.: ER-AE: differentially private text generation for authorship anonymization. In: Toutanova, K., et al. (eds.) Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pp. 3997–4007. Association for Computational Linguistics (2021). https://doi.org/10.18653/V1/2021.NAACL-MAIN.314, https://doi.org/10.18653/v1/2021.naacl-main.314
Carvalho, R.S., Vasiloudis, T., Feyisetan, O., Wang, K.: TEM: high utility metric differential privacy on text. In: Shekhar, S., Zhou, Z., Chiang, Y., Stiglic, G. (eds.) Proceedings of the 2023 SIAM International Conference on Data Mining, SDM 2023, Minneapolis-St. Paul Twin Cities, MN, USA, April 27-29, 2023, pp. 883–890. SIAM (2023). https://doi.org/10.1137/1.9781611977653.CH99
Chatzikokolakis, K., Andrés, M.E., Bordenabe, N.E., Palamidessi, C.: Broadening the scope of differential privacy using metrics. In: Cristofaro, E.D., Wright, M.K. (eds.) Privacy Enhancing Technologies - 13th International Symposium, PETS 2013, Bloomington, IN, USA, July 10-12, 2013. Proceedings. Lecture Notes in Computer Science, vol. 7981, pp. 82–102. Springer (2013). https://doi.org/10.1007/978-3-642-39077-7_5
Chau, M., Fang, X., Sheng, O.R.L.: Analysis of the query logs of a web site search engine. J. Assoc. Inf. Sci. Technol. 56(13), 1363–1376 (2005). https://doi.org/10.1002/ASI.20210
Chen, S., et al.: A customized text sanitization mechanism with differential privacy. In: Rogers, A., Boyd-Graber, J., Okazaki, N. (eds.) Findings of the Association for Computational Linguistics: ACL 2023, pp. 5747–5758. Association for Computational Linguistics, Toronto, Canada (2023). https://doi.org/10.18653/v1/2023.findings-acl.355, https://aclanthology.org/2023.findings-acl.355
Clauß, S., Schiffner, S.: Structuring anonymity metrics. In: Juels, A., Winslett, M., Goto, A. (eds.) Proceedings of the 2006 Workshop on Digital Identity Management, Alexandria, VA, USA, November 3, 2006, pp. 55–62. ACM (2006). https://doi.org/10.1145/1179529.1179539
Clifton, C., Tassa, T.: On syntactic anonymity and differential privacy. In: Chan, C.Y., Lu, J., Nørvåg, K., Tanin, E. (eds.) Workshops Proceedings of the 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, April 8-12, 2013, pp. 88–93. IEEE Computer Society (2013). https://doi.org/10.1109/ICDEW.2013.6547433
Craswell, N., Mitra, B., Yilmaz, E., Campos, D.: Overview of the TREC 2020 deep learning track. CoRR abs/2102.07662 (2021). https://arxiv.org/abs/2102.07662
Craswell, N., Mitra, B., Yilmaz, E., Campos, D., Voorhees, E.M.: Overview of the TREC 2019 deep learning track. CoRR abs/2003.07820 (2020). https://arxiv.org/abs/2003.07820
Damie, M., Hahn, F., Peter, A.: A highly accurate query-recovery attack against searchable encryption using non-indexed documents. In: Bailey, M.D., Greenstadt, R. (eds.) 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pp. 143–160. USENIX Association (2021). https://www.usenix.org/conference/usenixsecurity21/presentation/damie
De Faveri, F.L., Faggioli, G., Ferro, N.: py-PANTERA: a Python PAckage for Natural language obfuscaTion Enforcing pRivacy & Anonymization. In: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management (CIKM ’24), October 21-25, 2024, Boise, ID, USA. p. 6. Springer (2024). https://doi.org/10.1145/3627673.3679173, https://doi.org/10.1145/3627673.3679173
De Faveri, F.L., Faggioli, G., Ferro, N.: Words Blending Boxes. Obfuscating Queries in Information Retrieval using Differential Privacy. CoRR abs/2405.09306 (2024). https://doi.org/10.48550/ARXIV.2405.09306
Domingo-Ferrer, J., Sánchez, D., Blanco-Justicia, A.: The limits of differential privacy (and its misuse in data release and machine learning). Commun. ACM 64(7), 33–35 (2021). https://doi.org/10.1145/3433638
Duncan, G., Keller-McNulty, S., Stokes, L.: Disclosure risk vs. data utility: the RU confidentiality map. A Los Alamos National Laboratory Technical Report LA-UR-01-6428, 1–30 (2001)
Dwork, C., Kohli, N., Mulligan, D.K.: Differential privacy in practice: expose your epsilons! J. Priv. Confidentiality 9(2) (2019). https://doi.org/10.29012/JPC.689
Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) Theory of Cryptography, pp. 265–284. Springer, Berlin, Heidelberg (2006)
Faggioli, G., Ferro, N.: Query obfuscation for information retrieval through differential privacy. In: Goharian, N., Tonellotto, N., He, Y., Lipani, A., McDonald, G., Macdonald, C., Ounis, I. (eds.) Advances in Information Retrieval - 46th European Conference on Information Retrieval, ECIR 2024, Glasgow, UK, March 24-28, 2024, Proceedings, Part I. Lecture Notes in Computer Science, vol. 14608, pp. 278–294. Springer (2024). https://doi.org/10.1007/978-3-031-56027-9_17
Faveri, F.L.D., Faggioli, G., Ferro, N.: Beyond the parameters: Measuring actual privacy in obfuscated texts. In: Roitero, K., Viviani, M., Maddalena, E., Mizzaro, S. (eds.) Proceedings of the 14th Italian Information Retrieval Workshop, Udine, Italy, September 5-6, 2024. CEUR Workshop Proceedings, vol. 3802, pp. 53–57. CEUR-WS.org (2024). https://ceur-ws.org/Vol-3802/paper5.pdf
Feyisetan, O., Balle, B., Drake, T., Diethe, T.: Privacy- and utility-preserving textual analysis via calibrated multivariate perturbations. In: Caverlee, J., Hu, X.B., Lalmas, M., Wang, W. (eds.) Proceedings of the 13th International Conference on Web Search and Data Mining, pp. 178–186. ACM (2020). https://doi.org/10.1145/3336191.3371856
Feyisetan, O., Kasiviswanathan, S.: Private release of text embedding vectors. In: Pruksachatkun, Y., et al. (eds.) Proceedings of the First Workshop on Trustworthy Natural Language Processing, pp. 15–27. Association for Computational Linguistics, Online (2021). https://doi.org/10.18653/v1/2021.trustnlp-1.3, https://aclanthology.org/2021.trustnlp-1.3
Fröbe, M., Schmidt, E.O., Hagen, M.: Efficient query obfuscation with keyqueries. In: He, J., et al. (eds.) WI-IAT ’21: IEEE/WIC/ACM International Conference on Web Intelligence, Melbourne VIC Australia, December 14–17, 2021, pp. 154–161. ACM (2021). https://doi.org/10.1145/3486622.3493950
Habernal, I.: When differential privacy meets NLP: the devil is in the detail. In: Moens, M., Huang, X., Specia, L., Yih, S.W. (eds.) Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pp. 1522–1528. Association for Computational Linguistics (2021). https://doi.org/10.18653/V1/2021.EMNLP-MAIN.114
Hsu, J., et al.: Differential privacy: an economic method for choosing epsilon. In: IEEE 27th Computer Security Foundations Symposium, CSF 2014, Vienna, Austria, 19-22 July, 2014, pp. 398–410. IEEE Computer Society (2014). https://doi.org/10.1109/CSF.2014.35, https://doi.org/10.1109/CSF.2014.35
Izacard, G., et al.: Unsupervised dense information retrieval with contrastive learning. Trans. Mach. Learn. Res. 2022 (2022). https://openreview.net/forum?id=jKN1pXi7b0
Jansen, B.J., Spink, A., Saracevic, T.: Real life, real users, and real needs: a study and analysis of user queries on the web. Inf. Process. Manag. 36(2), 207–227 (2000). https://doi.org/10.1016/S0306-4573(99)00056-4
Kang, Y., Liu, Y., Niu, B., Tong, X., Zhang, L., Wang, W.: Input perturbation: a new paradigm between central and local differential privacy. CoRR abs/2002.08570 (2020). https://arxiv.org/abs/2002.08570
Klymenko, O., Meisenbacher, S., Matthes, F.: Differential privacy in natural language processing the story so far. In: Feyisetan, O., Ghanavati, S., Thaine, P., Habernal, I., Mireshghallah, F. (eds.) Proceedings of the Fourth Workshop on Privacy in Natural Language Processing, pp. 1–11. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.privatenlp-1.1, https://aclanthology.org/2022.privatenlp-1.1
Kohli, N., Laskowski, P.: Epsilon voting: mechanism design for parameter selection in differential privacy. In: 2018 IEEE Symposium on Privacy-Aware Computing, PAC 2018, Washington, DC, USA, September 26-28, 2018, pp. 19–30. IEEE (2018). https://doi.org/10.1109/PAC.2018.00009
Lee, J., Clifton, C.: How much is enough? Choosing \(\epsilon \) for differential privacy. In: Lai, X., Zhou, J., Li, H. (eds.) Information Security, 14th International Conference, ISC 2011, Xi’an, China, October 26-29, 2011. Proceedings. Lecture Notes in Computer Science, vol. 7001, pp. 325–340. Springer (2011). https://doi.org/10.1007/978-3-642-24861-0_22
Mattern, J., Weggenmann, B., Kerschbaum, F.: The limits of word level differential privacy. In: Carpuat, M., de Marneffe, M., Ruíz, I.V.M. (eds.) Findings of the Association for Computational Linguistics: NAACL 2022, Seattle, WA, United States, July 10-15, 2022, pp. 867–881. Association for Computational Linguistics (2022). https://doi.org/10.18653/V1/2022.FINDINGS-NAACL.65, https://doi.org/10.18653/v1/2022.findings-naacl.65
Meisenbacher, S.J., Nandakumar, N., Klymenko, A., Matthes, F.: A comparative analysis of word-level metric differential privacy: benchmarking the privacy-utility trade-off. In: Calzolari, N., Kan, M., Hoste, V., Lenci, A., Sakti, S., Xue, N. (eds.) Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC/COLING 2024, 20-25 May, 2024, Torino, Italy, pp. 174–185. ELRA and ICCL (2024). https://aclanthology.org/2024.lrec-main.16
Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995). https://doi.org/10.1145/219717.219748
Moffat, A., Zobel, J.: Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Inf. Syst. 27(1), 2:1–2:27 (2008). https://doi.org/10.1145/1416950.1416952
Moore, T., Clayton, R.: Evil searching: compromise and recompromise of internet hosts for phishing. In: Dingledine, R., Golle, P. (eds.) Financial Cryptography and Data Security, 13th International Conference, FC 2009, Accra Beach, Barbados, February 23-26, 2009. Revised Selected Papers. Lecture Notes in Computer Science, vol. 5628, pp. 256–272. Springer (2009). https://doi.org/10.1007/978-3-642-03549-4_16
National Institute of Standards and Technology: Information security. Tech. Rep. National Institute of Standards and Technology Special Publication 800-60, Volume 1 Revision 1, August, 2008, U.S. Department of Commerce, Washington, D.C. (2008). https://doi.org/10.6028/NIST.SP.800-60v1r1
Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., Deng, L.: MS MARCO: a human generated machine reading comprehension dataset. In: Besold, T.R., Bordes, A., d’Avila Garcez, A.S., Wayne, G. (eds.) Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 co-located with the 30th Annual Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, December 9, 2016. CEUR Workshop Proceedings, vol. 1773. CEUR-WS.org (2016). https://ceur-ws.org/Vol-1773/CoCoNIPS_2016_paper9.pdf
Rao, R.S., Pais, A.R.: Jail-Phish: an improved search engine based phishing detection system. Comput. Secur. 83, 246–267 (2019). https://doi.org/10.1016/J.COSE.2019.02.011
Rényi, A.: On measures of entropy and information. In: Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, volume 1: contributions to the theory of statistics, vol. 4, pp. 547–562. University of California Press (1961)
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR abs/1910.01108 (2019). http://arxiv.org/abs/1910.01108
Shokri, R., Stronati, M., Song, C., Shmatikov, V.: Membership inference attacks against machine learning models. In: 2017 IEEE Symposium on Security and Privacy, SP 2017, San Jose, CA, USA, May 22-26, 2017, pp. 3–18. IEEE Computer Society (2017). https://doi.org/10.1109/SP.2017.41
Silvestri, F.: Mining query logs: turning search usage data into knowledge. Found. Trends Inf. Retr. 4(1-2), 1–174 (2010). https://doi.org/10.1561/1500000013
Sousa, S., Kern, R.: How to keep text private? A systematic review of deep learning methods for privacy-preserving natural language processing. Artif. Intell. Rev. 56(2), 1427–1492 (2023). https://doi.org/10.1007/S10462-022-10204-6
Truex, S., Liu, L., Gursoy, M.E., Wei, W., Yu, L.: Effects of differential privacy and data skewness on membership inference vulnerability. In: First IEEE International Conference on Trust, Privacy and Security in Intelligent Systems and Applications, TPS-ISA 2019, Los Angeles, CA, USA, December 12-14, 2019, pp. 82–91. IEEE (2019). https://doi.org/10.1109/TPS-ISA48467.2019.00019
Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 5998–6008 (2017)
Voorhees, E.M.: Overview of the TREC 2004 robust track. In: Voorhees, E.M., Buckland, L.P. (eds.) Proceedings of the Thirteenth Text REtrieval Conference, TREC 2004, Gaithersburg, Maryland, USA, November 16-19, 2004. NIST Special Publication, vol. 500-261. National Institute of Standards and Technology (NIST) (2004). http://trec.nist.gov/pubs/trec13/papers/ROBUST.OVERVIEW.pdf
Wagner, I., Eckhoff, D.: Technical privacy metrics: a systematic survey. ACM Comput. Surv. 51(3), 57:1–57:38 (2018). https://doi.org/10.1145/3168389, https://doi.org/10.1145/3168389
Xu, Z., Aggarwal, A., Feyisetan, O., Teissier, N.: A differentially private text perturbation method using regularized mahalanobis metric. In: Proceedings of the Second Workshop on Privacy in NLP. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.privatenlp-1.2
Xu, Z., Aggarwal, A., Feyisetan, O., Teissier, N.: On a utilitarian approach to privacy preserving text generation. CoRR abs/2104.11838 (2021). https://doi.org/10.48550/ARXIV.2104.11838
Yue, X., Du, M., Wang, T., Li, Y., Sun, H., Chow, S.S.M.: Differential privacy for text analytics via natural text sanitization. In: Zong, C., Xia, F., Li, W., Navigli, R. (eds.) Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 3853–3866. Association for Computational Linguistics, Online (2021). https://doi.org/10.18653/v1/2021.findings-acl.337, https://aclanthology.org/2021.findings-acl.337
Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: Bertscore: evaluating text generation with BERT. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net (2020). https://openreview.net/forum?id=SkeHuCVFDr
Zhao, Y., Chen, J.: A survey on differential privacy for unstructured data content. ACM Comput. Surv. 54(10s), 207:1–207:28 (2022). https://doi.org/10.1145/3490237
Zimmerman, S., Thorpe, A., Fox, C., Kruschwitz, U.: Investigating the interplay between searchers’ privacy concerns and their search behavior. In: Piwowarski, B., Chevalier, M., Gaussier, É., Maarek, Y., Nie, J., Scholer, F. (eds.) Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2019, Paris, France, July 21-25, 2019, pp. 953–956. ACM (2019). https://doi.org/10.1145/3331184.3331280
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
De Faveri, F.L., Faggioli, G., Ferro, N. (2025). Measuring Actual Privacy of Obfuscated Queries in Information Retrieval. In: Hauff, C., et al. Advances in Information Retrieval. ECIR 2025. Lecture Notes in Computer Science, vol 15572. Springer, Cham. https://doi.org/10.1007/978-3-031-88708-6_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-88708-6_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-88707-9
Online ISBN: 978-3-031-88708-6
eBook Packages: Computer ScienceComputer Science (R0)