Abstract
We investigate the impact of popularity bias in false-positive metrics in the offline evaluation of recommender systems. Unlike their true-positive complements, false-positive metrics reward systems that minimize recommendations disliked by users. Our analysis is, to the best of our knowledge, the first to show that false-positive metrics tend to penalise popular items, the opposite behavior of true-positive metrics—causing a disagreement trend between both types of metrics in the presence of popularity biases. We present a theoretical analysis of the metrics that identifies the reason that the metrics disagree and determines rare situations where the metrics might agree—the key to the situation lies in the relationship between popularity and relevance distributions, in terms of their agreement and steepness—two fundamental concepts we formalize. We then examine three well-known datasets using multiple popular true- and false-positive metrics on 16 recommendation algorithms. Specific datasets are chosen to allow us to estimate both biased and unbiased metric values. The results of the empirical study confirm and illustrate our analytical findings. With the conditions of the disagreement of the two types of metrics established, we then determine under which circumstances true-positive or false-positive metrics should be used by researchers of offline evaluation in recommender systems.1
- G. Adomavicius and A. Tuzhilin. 2005. Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE Trans. Knowl. Data Eng. 17, 6 (Jun. 2005), 734–749.Google ScholarDigital Library
- C. Basu, H. Hirsh, and W. W. Cohen. 1998. Recommendation as classification: Using social and content-based information in recommendation. In Proceedings of the 15th National Conference on Artificial Intelligence (AAAI’98). AAAI Press, Menlo Park, CA, 714–720.Google ScholarDigital Library
- R. F. Baumeister, E. Bratslavsky, C. Finkenauer, and K. D. Vohs. 2001. Bad is stronger than good. Rev. Gen. Psychol. 5, 4 (Dec. 2001), 323–370.Google ScholarCross Ref
- A. Bellogín, P. Castells, and I. Cantador. 2011. Precision-oriented evaluation of recommender systems: An algorithmic comparison. In Proceedings of the 5th ACM Conference on Recommender Systems (RecSys’11). ACM, New York, NY, 333–336.Google Scholar
- A. Bellogín, P. Castells, and I. Cantador. 2017. Statistical biases in information retrieval metrics for recommender systems. Inf. Retriev. 20, 6 (Jul. 2017), 606–634.Google ScholarDigital Library
- A. Broder, M. Ciaramita, M. Fontoura, E. Gabrilovich, V. Josifovski, D. Metzler, V. Murdock, and V. Plachouras. 2008. To swing or not to swing: Learning when (not) to advertise. In Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM’13). ACM, New York, NY, 1003–1012. Google Scholar
- Marc Bron, Ke Zhou, Andy Haines, and Mounia Lalmas. 2019. Uncovering bias in ad feedback data analyses & applications. In Proceedings of the 3rd International Workshop on Augmenting Intelligence with Bias-Aware Humans in the Loop (HumBL @ WWW’19). ACM, New York, NY, 614–623.Google ScholarDigital Library
- B. Brost, R. Mehrotra, and T. Jehan. 2019. The music streaming sessions dataset. In Proceedings of the World Wide Web Conference (WWW ’19). ACM, New York, NY, 2594–2600.Google Scholar
- C. Buckley and E. M. Voorhees. 2004. Retrieval evaluation with incomplete information. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’04). ACM, New York, NY, 25–32.Google Scholar
- R. Cañamares and P. Castells. 2017. A probabilistic reformulation of memory-based collaborative filtering: Implications on popularity biases. In Proceedings of the 40th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’17). ACM, New York, NY, 215–224.Google Scholar
- R. Cañamares and P. Castells. 2018. Should I follow the crowd? A probabilistic analysis of the effectiveness of popularity in recommender systems. In Proceedings of the 41st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’18). ACM, New York, NY, 415–424.Google Scholar
- R. Cañamares and P. Castells. 2020. On target item sampling in offline recommender system evaluation. In 14th ACM Conference on Recommender Systems (RecSys’20). ACM, New York, NY, 259–268.Google Scholar
- B. Carterette. 2011. System effectiveness, user models, and user utility: A conceptual framework for investigation. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’11). ACM, New York, NY, 903–912. Google ScholarDigital Library
- P. Castells and R. Cañamares. 2018. Characterization of fair experiments for recommender system evaluation: A formal analysis. In Proceedings of the Workshop on Offline Evaluation for Recommender Systems (REVEAL 2018) at the 12th ACM Conference on Recommender Systems (RecSys’18).Google Scholar
- P. Castells, N. J. Hurley, and S. Vargas. 2015. Novelty and diversity in recommender systems. In Recommender Systems Handbook (2nd ed.), F. Ricci, L. Rokach, and B. Shapira (Eds.). Springer, New York, NY, 881–918.Google Scholar
- O. Chapelle, D. Metlzer, Y. Zhang, and P. Grinspan. 2009. Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM’09). ACM, New York, NY, 621–630. Google Scholar
- P. Y. K. Chau, S. Y. Ho, K. K. W. Ho, and Y. Yao. 2013. Examining the effects of malfunctioning personalized services on online users’ distrust and behaviors. Decis. Support Syst. 56, C (Dec. 2013), 180–191.Google Scholar
- C. L. A. Clarke, N. Craswell, I. Soboroff, and A. Ashkan. 2011. A comparative analysis of cascade measures for novelty and diversity. In Proceedings of the 4th ACM International Conference on Web Search and Data Mining (WSDM’11). ACM, New York, NY, 75–84. Google Scholar
- P. Cremonesi, F. Garzotto, S. Negro, A. V. Papadopoulos, and R. Turrin. 2011. Looking for “Good” recommendations: A comparative evaluation of recommender systems. In Proceedings of the 13th International Conference on Human-Computer Interaction (Interact’11). Springer, New York, NY, 152–168.Google Scholar
- P. Cremonesi, F. Garzotto, and R. Turrin. 2013. User-centric vs. system-centric evaluation of recommender systems. In Proceedings of the 14th International Conference on Human-Computer Interaction (Interact’13). Springer, New York, NY, 334–351.Google Scholar
- C. Elkan. 2001. The foundations of cost-sensitive learning. In Proceedings of the 17th International Joint Conference of Artificial Intelligence (IJCAI’01). Morgan Kaufmann, Burlington, MA, 973–978.Google Scholar
- B. Fields. 2011. Contextualize Your Listening: The Playlist as Recommendation Engine. Ph.D. Dissertation. Goldsmiths, University of London.Google Scholar
- D. Fleder and K. Hosanagar. 2009. Blockbuster culture’s next rise or fall: The impact of recommender systems on sales diversity. Manage. Sci. 55, 5 (May 2009), 697–712.Google Scholar
- E. Frolov and I. Oseledets. 2016. Fifty shades of ratings: How to benefit from a negative feedback in top-N recommendations tasks. In Proceedings of the 10th ACM Conference on Recommender Systems (RecSys’16). ACM, New York, NY, 91–98.Google Scholar
- Z. Gantner, L. Drumond, C. Freudenthaler, and L. Schmidt-Thieme. 2011. Personalized ranking for non-uniformly sampled items. In Proceedings of the International Conference on KDD Cup 2011 (KDDCUP’11). JLMR.org, 231–247.Google Scholar
- J. Garcia-Gathright, B. St. Thomas, C. Hosey, Z. Nazari, and F. Diaz. 2018. Understanding and evaluating user satisfaction with music discovery. In Proceedings of the 41st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’18). ACM, New York, NY, 55–64. Google Scholar
- A. Germain and J. Chakareski. 2013. Spotify me: Facebook-assisted automatic playlist generation. In Proceedings of the IEEE 15th International Workshop on Multimedia Signal Processing (MMSP’13). IEEE Press, Los Alamitos, CA, 25–28.Google Scholar
- A. Gilotte, C. Calauzènes, T. Nedelec, A. Abraham, and S. Dollé. 2018. Offline A/B testing for recommender systems. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining (WSDM’18). ACM, New York, NY, 198–206.Google Scholar
- D. G. Goldstein, R. P. McAfee, and S. Suri. 2013. The cost of annoying ads. In Proceedings of the 22nd International Conference on World Wide Web (WWW’13). ACM, New York, NY, 459–470.Google Scholar
- P. Gopalan, J. M. Hofman, and D. M. Blei. 2015. Scalable recommendation with poisson factorization. In Proceedings of the 31st Conference on Uncertainty in Artificial Intelligence (UAI’15). AUAI Press, Arlington, VA, 326–335.Google Scholar
- A. Gruson, P. Chandar, C. Charbuillet, J. McInerney, S. Hansen, D. Tardieu, and B. Carterette. 2019. Offline evaluation to make decisions about playlist recommendation. In Proceedings of the 12th ACM International Conference on Web Search and Data Mining (WSDM’19). ACM, New York, NY, 420–428.Google Scholar
- A. Gunawardana and G. Shani. 2015. Evaluating recommender systems. In Recommender Systems Handbook (2nd ed.), F. Ricci, L. Rokach, and B. Shapira (Eds.). Springer, New York, NY, 265–308.Google Scholar
- G. Guo, J. Zhang, Z. Sun, and N. Yorke-Smith. 2015. LibRec: A Java library for recommender systems. In Posters, Demos, Late-breaking Results and Workshop Proceedings of the 23rd Conference on User Modelling, Adaptation and Personalization (UMAP’15).Google Scholar
- X. He, H. Zhang, M. Y. Kan, and T. S. Chua. 2016. Fast matrix factorization for online recommendation with implicit feedback. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’16). ACM, New York, NY, 549–558.Google Scholar
- J. L. Herlocker, J. A. Konstan, L. G. Terveen, and J. T. Riedl. 2004. Evaluating collaborative filtering recommender systems. ACM Trans. Inf. Syst. 22, 1 (Jan. 2004), 5–53.Google ScholarDigital Library
- J. M. Hernández-Lobato, N. Houlsby, and Z. Ghahramani. 2014. Probabilistic matrix factorization with non-random missing data. In Proceedings of the 31st International Conference on Machine Learning (ICML’14). 1512–1520.Google Scholar
- T. Hofmann. 2004. Latent semantic models for collaborative filtering. ACM Trans. Inf. Syst. 22, 1 (Jan. 2004), 89–115.Google ScholarDigital Library
- Y. Hu, Y. Koren, and C. Volinsky. 2008. Collaborative filtering for implicit feedback datasets. In Proceedings of the 8th IEEE International Conference on Data Mining (ICDM’08). IEEE Computer Society, Los Alamitos, CA, 15–19.Google Scholar
- M. Jahrer and A. Töscher. 2011. Collaborative filtering ensemble for ranking. In Proceedings of the International Conference on KDD Cup 2011 (KDDCUP’11). JLMR.org, 153–167.Google Scholar
- D. Jannach, L. Lerche, I. Kamehkhosh, and M. Jugovac. 2015. What recommenders recommend: An analysis of recommendation biases and possible countermeasures. User Model’ User-Adapt’ Interact’ 25, 5 (Dec. 2015), 427–491.Google Scholar
- Y. Koren. 2008. Factorization meets the neighborhood: A multifaceted collaborative filtering model. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’08). ACM, New York, NY, USA, 426–434.Google ScholarDigital Library
- Y. Koren. 2009. Collaborative filtering with temporal dynamics. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’09). ACM, New York, NY, 447–456. Google ScholarDigital Library
- W. Krichene and S. Rendle. 2020. On sampled metrics for item recommendation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’20). ACM, New York, NY, 1748–1757.Google Scholar
- A. Lipani, M. Lupu, and A. Hanbury. 2015. Splitting water: Precision and anti-precision to reduce pool bias. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’15). ACM, New York, NY, 103–112.Google Scholar
- Dugang Liu, Pengxiang Cheng, Zhenhua Dong, Xiuqiang He, Weike Pan, and Zhong Ming. 2020. A general knowledge distillation framework for counterfactual recommendation via uniform data. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’20). ACM, New York, NY, 831–840. Google ScholarDigital Library
- H. Lu, M. Zhang, W. Ma, C. Wang, F. xia, Y. Liu, L. Lin, and S. Ma. 2019. Effects of user negative experience in mobile news streaming. In Proceedings of the 42nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’19). 705–714.Google Scholar
- B. M. Marlin and R. S. Zemel. 2009. Collaborative prediction and ranking with non-random missing data. In Proceedings of the 3rd ACM Conference on Recommender Systems (RecSys’09). ACM, New York, NY, 5–12.Google Scholar
- B. M. Marlin, R. S. Zemel, S. T. Roweis, and M. Slaney. 2007. Collaborative filtering and the missing at random assumption. In Proceedings of the 23rd Conference on Uncertainty in Artificial Intelligence (UAI’07). 267–275.Google Scholar
- F. M. Maxwell and J. A. Konstan. 2015. The movielens datasets: History and context. ACM Transactions on Interactive Intelligent Systems (TIIS) 5, 4 (Dec. 2015), 1--19.Google Scholar
- S. M. McNee, J. Riedl, and J. A. Konstan. 2006. Being accurate is not enough: How accuracy metrics have hurt recommender systems. In Proceedings of ACM CHI 2006 Conference on Human Factors in Computing Systems (CHI’06). ACM, New York, NY, 1097–1101.Google Scholar
- E. Mena-Maldonado, R. Cañamares, P. Castells, Y. Ren, and M. Sanderson. 2020. Agreement and disagreement between true and false-positive metrics in recommender systems evaluation. In Proceedings of the 43rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’20). ACM, New York, NY, 841–850.Google Scholar
- A. Moffat, F. Scholer, and Z. Yang. 2018. Estimating measurement uncertainty for information retrieval effectiveness metrics. J’ Data Inf’ Qual’ 10, 3 (2018), 1--22.Google Scholar
- A. Moffat and J. Zobel. 2008. Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems (TOIS) 27, 1 (Dec. 2008), 1--27. Google ScholarDigital Library
- R. J. Mooney and L. Roy. 1999. Content-based book recommending using learning for text categorization. In Proceedings of the 5th ACM Conference on Digital Libraries. ACM, New York, NY, 195–204.Google ScholarDigital Library
- X. Ning, C. Desrosiers, and G. Karypis. 2015. A comprehensive survey of neighborhood-based recommender systems. In Recommender Systems Handbook (2nd ed.), F. Ricci, L. Rokach, and B. Shapira (Eds.). Springer, New York, NY, 37–76.Google Scholar
- X. Ning and G. Karypis. 2011. SLIM: Sparse linear methods for top-N recommender systems. In Proceedings of the IEEE 11th International Conference on Data Mining (ICDM’11). IEEE Computer Society, Los Alamitos, CA, 497–506.Google Scholar
- E. Pampalk, T. Pohle, and G. Widmer. 2005. Dynamic playlist generation based on skip-ping behavior. In Proceedings of the 6th International Conference on Music Information Retrieval (ISMIR’05). 634–637.Google Scholar
- W. Pan and L. Chen. 2013. GBPR: Group preference based Bayesian personalized ranking for one-class collaborative filtering. In Proceedings of the 23rd International Joint Conference of Artificial Intelligence (IJCAI’13). AAAI Press, 2691–2697.Google Scholar
- L. A. S. Pizzato, T. Rej, J. Akehurst, I. Koprinska, K. Yacef, and J. Kay. 2013. Recommending people to people: The nature of reciprocal recommenders with a case study in online dating. User Model. User-Adapt. Interact. 23, 5 (Nov. 2013), 447–488.Google ScholarDigital Library
- F. Provost and T. Fawcett. 2001. Robust classification for imprecise environments. Mach. Learn. 42, 3 (Mar. 2001), 203–231.Google ScholarDigital Library
- Filip Radlinski, Madhu Kurup, and Thorsten Joachims. 2008. How does clickthrough data reflect retrieval quality? In Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM’08). ACM, New York, NY. Google ScholarDigital Library
- S. E. Robertson. 1977. The probability ranking in IR. J. Document. 33, 4 (Jan. 1977), 294–304.Google ScholarCross Ref
- T. Sakai. 2007. Alternatives to Bpref. In Proceedings of the 30th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’07). ACM, New York, NY, 71–78.Google ScholarDigital Library
- T. Sakai and N. Kando. 2008. On information retrieval metrics designed for evaluation with incomplete relevance assessments. Inf. Retriev. 11, 5 (Mar. 2008), 447–470.Google Scholar
- P. Sánchez and A. Bellogín. 2018. Measuring anti-relevance: A study on when recommendation algorithms produce bad suggestions. In Proceedings of the 12th ACM Conference on Recommender Systems (RecSys’18). ACM, New York, NY, 367–371.Google Scholar
- T. Schnabel, A. Swaminathan, A. Singh, N. Chandak, and T. Joachims. 2016. Recommendations as treatments: Debiasing learning and evaluation. In Proceedings of the 3rd International Conference on Machine Learning (ICML’16). 1670–1679.Google Scholar
- Y. Shi, M. Larson, and A. Hanjalic. 2010. List-wise learning to rank with matrix factorization for collaborative filtering. In Proceedings of the 4th ACM Conference on Recommender Systems (RecSys’10). ACM, New York, NY, 269–272.Google Scholar
- H. Steck. 2010. Training and testing of recommender systems on data missing not at random. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’10). ACM, New York, NY, 713–722.Google ScholarDigital Library
- H. Steck. 2011. Item popularity and recommendation accuracy. In Proceedings of the 5th ACM Conference on Recommender Systems (RecSys’11). ACM, New York, NY, 125–132.Google ScholarDigital Library
- H. Steck. 2013. Evaluation of recommendations: Rating prediction and ranking. In Proceedings of the 7th ACM Conference on Recommender Systems (RecSys’13). ACM, New York, NY, 213–220.Google ScholarDigital Library
- A. Swaminathan, A. Krishnamurthy, A. Agarwal, M. Dudík, J. Langford, D. Jose, and I. Zitouni. 2017. Off-policy evaluation for slate recommendation. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS’17). Curran Associates, Inc., Red Hook, NY, 3635–3645.Google Scholar
- L. Törnqvist, P. Vartia, and Y. O. Vartia. 1985. How should relative changes be measured. Am. Stat. 39, 1 (Feb. 1985), 43–46.Google Scholar
- K. Wang, T. Walker, and Z. Zheng. 2019. PSkip: Estimating relevance ranking quality from web search clickthrough data. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’19). ACM, New York, NY, 1355–1364.Google Scholar
- A. F. Wicaksono and A. Moffat. 2020. Metrics, user models, and satisfaction. In Proceedings of the 13th ACM International Conference on Web Search and Data Mining (WSDM’20). ACM, New York, NY, 654–662. Google Scholar
- L. Yang, Y. Cui, Y. Xuan, C. Wang, S. Belongie, and D. Estrin. 2018. Unbiased offline recommender evaluation for missing-not-at-random implicit feedback. In Proceedings of the 12th ACM Conference on Recommender Systems (RecSys’18). ACM, New York, NY, 279–287.Google Scholar
- E. Yilmaz and J. A. Aslam. 2006. Estimating average precision with incomplete and imperfect judgments. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM’06). ACM, New York, NY, 102–111.Google Scholar
- E. Yilmaz and J. A. Aslam. 2008. Estimating average precision when judgments are incomplete. Knowledge and Information Systems 16, 2 (Aug. 2008), 173–211.Google ScholarCross Ref
- D. Yin, S. D. Bond, and H. Zhang. 2010. Are bad reviews always stronger than good? Asymmetric negativity bias in the formation of online consumer trust. In Proceedings of the 31st International Conference on Information Systems (ICIS’10). Association for Information Systems, 1–18.Google Scholar
- Z. Yuan and E. Oja. 2005. Projective nonnegative matrix factorization for image compression and feature extraction. In Proceedings of the 14th Scandinavian Conference on Image Analysis (SCIA’05). Springer-Verlag, Berlin, 333–342.Google Scholar
- C. Zhai and J. Lafferty. 2006. A risk minimization framework for information retrieval. Inf. Process. Manage. 42, 1 (Jan. 2006), 31–55.Google ScholarCross Ref
Index Terms
- Popularity Bias in False-positive Metrics for Recommender Systems Evaluation
Recommendations
User-centered Evaluation of Popularity Bias in Recommender Systems
UMAP '21: Proceedings of the 29th ACM Conference on User Modeling, Adaptation and PersonalizationRecommendation and ranking systems are known to suffer from popularity bias; the tendency of the algorithm to favor a few popular items while under-representing the majority of other items. Prior research has examined various approaches for mitigating ...
Agreement and Disagreement between True and False-Positive Metrics in Recommender Systems Evaluation
SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information RetrievalFalse-positive metrics can capture an important side of recommendation quality, focusing on the impact of suggestions that are disliked by users, as a complement of common metrics that only measure the amount of successful recommendations. In this paper ...
Statistical biases in Information Retrieval metrics for recommender systems
AbstractThere is an increasing consensus in the Recommender Systems community that the dominant error-based evaluation metrics are insufficient, and mostly inadequate, to properly assess the practical effectiveness of recommendations. Seeking to evaluate ...
Comments