skip to main content
10.1145/3351095.3375709acmconferencesArticle/Chapter ViewAbstractPublication PagesfacctConference Proceedingsconference-collections
research-article

Towards fairer datasets: filtering and balancing the distribution of the people subtree in the ImageNet hierarchy

Published:27 January 2020Publication History

ABSTRACT

Computer vision technology is being used by many but remains representative of only a few. People have reported misbehavior of computer vision models, including offensive prediction results and lower performance for underrepresented groups. Current computer vision models are typically developed using datasets consisting of manually annotated images or videos; the data and label distributions in these datasets are critical to the models' behavior. In this paper, we examine ImageNet, a large-scale ontology of images that has spurred the development of many modern computer vision methods. We consider three key factors within the person subtree of ImageNet that may lead to problematic behavior in downstream computer vision technology: (1) the stagnant concept vocabulary of WordNet, (2) the attempt at exhaustive illustration of all categories with images, and (3) the inequality of representation in the images within concepts. We seek to illuminate the root causes of these concerns and take the first steps to mitigate them constructively.

Skip Supplemental Material Section

Supplemental Material

References

  1. U.S. House. 101st Congress, 2nd Session. 101 H.R. 2273. 1990. Americans with Disabilities Act of 1990. Washington: Government Printing Office.Google ScholarGoogle Scholar
  2. U.S. House. 88th Congress, 1st Session. 88 H.R. 6060. 1963. Equal Pay Act of 1963. Washington: Government Printing Office.Google ScholarGoogle Scholar
  3. U.S. House. 98th Congress, 2nd Session. 98 H.R. 5490. 1984. Civil Rights Act of 1984. Washington: Government Printing Office.Google ScholarGoogle Scholar
  4. Giorgia Aiello and Anna Woodhouse. 2016. When corporations come to define the visual politics of gender. Journal of Language and Politics 15, 3 (2016), 351--366.Google ScholarGoogle ScholarCross RefCross Ref
  5. Jeanette Altarriba, Lisa M Bauer, and Claudia Benvenuto. 1999. Concreteness, context availability, and imageability ratings and word associations for abstract, concrete, and emotion words. Behavior Research Methods, Instruments, & Computers 31, 4 (1999), 578--602.Google ScholarGoogle ScholarCross RefCross Ref
  6. Solon Barocas, Elizabeth Bradley, Vasant Honavar, and Foster Provost. 2017. Big data, data science, and civil rights. arXiv preprint arXiv:1706.03102 (2017).Google ScholarGoogle Scholar
  7. Solon Barocas and Andrew D Selbst. 2016. Big data's disparate impact. Calif. L. Rev. 104 (2016), 671.Google ScholarGoogle Scholar
  8. Kristy Beers Fägersten. 2007. A sociolinguistic analysis of swear word offensiveness. Universität des Saarlands.Google ScholarGoogle Scholar
  9. Helen Bird, Sue Franklin, and David Howard. 2001. Age of acquisition and imageability ratings for a large set of words, including verbs and function words. Behavior Research Methods, Instruments, & Computers 33, 1 (2001), 73--79.Google ScholarGoogle ScholarCross RefCross Ref
  10. Joy Buolamwini and Timnit Gebru. 2018. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency. 77--91.Google ScholarGoogle Scholar
  11. Kaylee Burns, Lisa Anne Hendricks, Kate Saenko, Trevor Darrell, and Anna Rohrbach. 2018. Women also snowboard: Overcoming bias in captioning models. In European Conference on Computer Vision.Google ScholarGoogle Scholar
  12. Toon Calders, Faisal Kamiran, and Mykola Pechenizkiy. 2009. Building classifiers with independency constraints. In 2009 IEEE International Conference on Data Mining Workshops. IEEE, 13--18.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Toon Calders and Sicco Verwer. 2010. Three naive Bayes approaches for discrimination-free classification. Data Mining and Knowledge Discovery 21, 2 (2010), 277--292.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. L Elisa Celis and Vijay Keswani. 2019. Implicit Diversity in Image Summarization. arXiv preprint arXiv:1901.10265 (2019).Google ScholarGoogle Scholar
  15. Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. 2015. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012 (2015).Google ScholarGoogle Scholar
  16. A Chardon, I Cretois, and C Hourseau. 1991. Skin colour typology and suntanning pathways. International journal of cosmetic science 13, 4 (1991), 191--208.Google ScholarGoogle Scholar
  17. Michael J Cortese and April Fugett. 2004. Imageability ratings for 3,000 monosyllabic words. Behavior Research Methods, Instruments, & Computers 36, 3 (2004), 384--387.Google ScholarGoogle ScholarCross RefCross Ref
  18. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In IEEE conference on computer vision and pattern recognition.Google ScholarGoogle ScholarCross RefCross Ref
  19. Terrance DeVries, Ishan Misra, Changhan Wang, and Laurens van der Maaten. 2019. Does Object Recognition Work for Everyone? arXiv preprint arXiv:1906.02659 (2019).Google ScholarGoogle Scholar
  20. Jean-Marc Dewaele. 2016. Thirty shades of offensiveness: L1 and LX English usersâĂŹ understanding, perception and self-reported use of negative emotionladen words. Journal of Pragmatics 94 (2016), 112--127.Google ScholarGoogle ScholarCross RefCross Ref
  21. Djellel Difallah, Elena Filatova, and Panos Ipeirotis. 2018. Demographics and dynamics of mechanical Turk workers. In Proceedings of the eleventh acm international conference on web search and data mining. ACM, 135--143.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Chris Dulhanty and Alexander Wong. 2019. Auditing ImageNet: Towards a Model-driven Framework for Annotating Demographic Attributes of Large-Scale Image Datasets. arXiv preprint arXiv:1905.01347 (2019).Google ScholarGoogle Scholar
  23. Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. 2012. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference. ACM, 214--226.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Harrison Edwards and Amos Storkey. 2016. Censoring representations with an adversary. In ICLR.Google ScholarGoogle Scholar
  25. Eran Eidinger, Roee Enbar, and Tal Hassner. 2014. Age and gender estimation of unfiltered faces. IEEE Transactions on Information Forensics and Security 9, 12 (2014), 2170--2179.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Mark Everingham, Luc Gool, Christopher K. Williams, John Winn, and Andrew Zisserman. 2010. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vision 88, 2 (June 2010), 303--338.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Michael Feldman, Sorelle A Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian. 2015. Certifying and removing disparate impact. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 259--268.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Thomas B Fitzpatrick. 1988. The validity and practicality of sun-reactive skin types I through VI. Archives of dermatology 124, 6 (1988), 869--871.Google ScholarGoogle Scholar
  29. Paul Frosh. 2001. Inside the image factory: stock photography and cultural production. Media, Culture & Society 23, 5 (2001), 625--646.Google ScholarGoogle ScholarCross RefCross Ref
  30. Paul Frosh. 2002. Rhetorics of the Overlooked: On the communicative modes of stock advertising images. Journal of Consumer Culture 2, 2 (2002), 171--196.Google ScholarGoogle ScholarCross RefCross Ref
  31. Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna M. Wallach, Hal Daumé III, and Kate Crawford. 2018. Datasheets for Datasets. In Workshop on Fairness, Accountability, and Transparency in Machine Learning.Google ScholarGoogle Scholar
  32. Ken J Gilhooly and Robert H Logie. 1980. Age-of-acquisition, imagery, concreteness, familiarity, and ambiguity measures for 1,944 words. Behavior research methods & instrumentation 12, 4 (1980), 395--427.Google ScholarGoogle Scholar
  33. Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision. 1440--1448.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Moritz Hardt, Eric Price, Nati Srebro, et al. 2016. Equality of opportunity in supervised learning. In Advances in neural information processing systems. 3315--3323.Google ScholarGoogle Scholar
  35. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  36. Lisa Anne Hendricks, Kaylee Burns, Kate Saenko, Trevor Darrell, and Anna Rohrbach. 2018. Women also snowboard: Overcoming bias in captioning models. In European Conference on Computer Vision. Springer, 793--811.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Minyoung Huh, Pulkit Agrawal, and Alexei A Efros. 2016. What makes ImageNet good for transfer learning? arXiv preprint arXiv:1608.08614 (2016).Google ScholarGoogle Scholar
  38. Ben Hutchinson and Margaret Mitchell. 2019. 50 Years of Test (Un)fairness: Lessons for Machine Learning. In ACM Conference on Fairness, Accountability and Transparency.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Faisal Kamiran and Toon Calders. 2009. Classifying without discriminating. In 2009 2nd International Conference on Computer, Control and Communication. IEEE, 1--6.Google ScholarGoogle ScholarCross RefCross Ref
  40. Toshihiro Kamishima, Shotaro Akaho, and Jun Sakuma. 2011. Fairness-aware learning through regularization approach. In 2011 IEEE 11th International Conference on Data Mining Workshops. IEEE, 643--650.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Matthew Kay, Cynthia Matuszek, and Sean A Munson. 2015. Unequal representation and gender stereotypes in image search results for occupations. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. ACM, 3819--3828.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Brendan F Klare, Ben Klein, Emma Taborsky, Austin Blanton, Jordan Cheney, Kristen Allen, Patrick Grother, Alan Mah, and Anil K Jain. 2015. Pushing the frontiers of unconstrained face detection and recognition: Iarpa janus benchmark a. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1931--1939.Google ScholarGoogle ScholarCross RefCross Ref
  43. Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. 2017. Inherent trade-offs in the fair determination of risk scores. In Proc. 8th Conf. on Innovations in Theoretical Computer Science (ITCS).Google ScholarGoogle Scholar
  44. Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. 2017. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. International Journal of Computer Vision 123, 1 (01 May 2017), 32--73.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097--1105.Google ScholarGoogle Scholar
  46. Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Tom Duerig, et al. 2018. The Open Images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. arXiv preprint arXiv:1811.00982 (2018).Google ScholarGoogle Scholar
  47. Sam Levin. 2016. A beauty contest was judged by AI and the robots didnâĂŹt like dark skin.Google ScholarGoogle Scholar
  48. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In European conference on computer vision.Google ScholarGoogle ScholarCross RefCross Ref
  49. Fayao Liu, Chunhua Shen, and Guosheng Lin. 2015. Deep convolutional neural fields for depth estimation from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5162--5170.Google ScholarGoogle ScholarCross RefCross Ref
  50. Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. 2015. Deep Learning Face Attributes in the Wild. In ICCV.Google ScholarGoogle Scholar
  51. Ziwei Liu, Zhongqi Miao, Xiaohang Zhan, Jiayun Wang, Boqing Gong, and Stella X Yu. 2019. Large-Scale Long-Tailed Recognition in an Open World. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2537--2546.Google ScholarGoogle ScholarCross RefCross Ref
  52. David Madras, Elliot Creager, Toniann Pitassi, and Richard Zemel. 2018. Learning adversarially fair and transferable representations. arXiv preprint arXiv:1802.06309 (2018).Google ScholarGoogle Scholar
  53. George A Miller. 1998. WordNet: An electronic lexical database. MIT press.Google ScholarGoogle Scholar
  54. Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2018. Model Cards for Model Reporting. In ACM Conference on Fairness, Accountability and Transparency.Google ScholarGoogle Scholar
  55. Safiya Umoja Noble. 2018. Algorithms of oppression: How search engines reinforce racism. nyu Press.Google ScholarGoogle Scholar
  56. The Office of Communications (Ofcom). 2016. Attitudes To Potentially Offensive Language and Gestures on TV and Radio. Technical Report. https://www.ofcom.org.uk/research-and-data/tv-radio-and-on-demand/tv-research/offensive-language-2016Google ScholarGoogle Scholar
  57. Allan Paivio. 1965. Abstractness, imagery, and meaningfulness in paired-associate learning. Journal of Verbal Learning and Verbal Behavior 4, 1 (1965), 32--38.Google ScholarGoogle ScholarCross RefCross Ref
  58. Allan Paivio, John C Yuille, and Stephen A Madigan. 1968. Concreteness, imagery, and meaningfulness values for 925 nouns. Journal of experimental psychology 76, 1p2 (1968), 1.Google ScholarGoogle ScholarCross RefCross Ref
  59. Dino Pedreshi, Salvatore Ruggieri, and Franco Turini. 2008. Discrimination-aware data mining. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 560--568.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Geoff Pleiss, Manish Raghavan, Felix Wu, Jon Kleinberg, and Kilian Q Weinberger. 2017. On fairness and calibration. In Advances in Neural Information Processing Systems. 5680--5689.Google ScholarGoogle Scholar
  61. Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 779--788.Google ScholarGoogle ScholarCross RefCross Ref
  62. Lauren Rhue. 2018. Racial Influence on Automated Perceptions of Emotions. (2018). https://ssrn.com/abstract=3281765Google ScholarGoogle Scholar
  63. Joel Ross, Andrew Zaldivar, Lilly Irani, and Bill Tomlinson. 2009. Who are the turkers? worker demographics in amazon mechanical turk. Department of Informatics, University of California, Irvine, USA, Tech. Rep (2009).Google ScholarGoogle Scholar
  64. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. ImageNet Large Scale Visual Recognition Challenge. International journal of computer vision 115, 3 (2015), 211--252.Google ScholarGoogle Scholar
  65. Hee Jung Ryu, Hartwig Adam, and Margaret Mitchell. 2018. Inclusivefacenet: Improving face attribute detection with race and gender diversity. In Proceedings of FATML.Google ScholarGoogle Scholar
  66. Barry S Sapolsky, Daniel M Shafer, and Barbara K Kaye. 2010. Rating offensive words in three television program contexts. Mass Communication and Society 14, 1 (2010), 45--70.Google ScholarGoogle ScholarCross RefCross Ref
  67. Shreya Shankar, Yoni Halpern, Eric Breck, James Atwood, Jimbo Wilson, and D Sculley. 2017. No classification without representation: Assessing geodiversity issues in open data sets for the developing world. In NeurIPS workshop: Machine Learning for the Developing World.Google ScholarGoogle Scholar
  68. Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems. 568--576.Google ScholarGoogle Scholar
  69. Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google ScholarGoogle Scholar
  70. Pierre Stock and Moustapha Cisse. 2018. Convnets and ImageNet beyond accuracy: Understanding mistakes and uncovering biases. In Proceedings of the European Conference on Computer Vision (ECCV). 498--512.Google ScholarGoogle ScholarCross RefCross Ref
  71. Hirotsugu Takiwaki et al. 1998. Measurement of skin color: practical application and theoretical considerations. Journal of Medical Investigation 44 (1998), 121--126.Google ScholarGoogle Scholar
  72. Antonio Torralba, Alexei A Efros, et al. 2011. Unbiased look at dataset bias.. In CVPR, Vol. 1. Citeseer, 7.Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Antonio Torralba, Rob Fergus, and William T. Freeman. 2008. 80 Million Tiny Images: A Large Data Set for Nonparametric Object and Scene Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 30, 11 (Nov. 2008), 1958--1970.Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3156--3164.Google ScholarGoogle ScholarCross RefCross Ref
  75. Yilun Wang and Michal Kosinski. 2018. Deep neural networks are more accurate than humans at detecting sexual orientation from facial images. Journal of Personality and Social Psychology (JPSP) (2018).Google ScholarGoogle Scholar
  76. P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. 2010. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001. California Institute of Technology.Google ScholarGoogle Scholar
  77. Meredith Whittaker, Kate Crawford, Roel Dobbe, Genevieve Fried, Elizabeth Kaziunas, Varoon Mathur, Sarah Myers West, Rashida Richardson, Jason Schultz, and Oscar Schwartz. 2018. AI Now Report 2018. https://ainowinstitute.org/AI_Now_2018_Report.pdf.Google ScholarGoogle Scholar
  78. Lydia TS Yee. 2017. Valence, arousal, familiarity, concreteness, and imageability ratings for 292 two-character Chinese nouns in Cantonese speakers in Hong Kong. PloS one 12, 3 (2017), e0174569.Google ScholarGoogle ScholarCross RefCross Ref
  79. Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P Gummadi. 2015. Fairness constraints: Mechanisms for fair classification. arXiv preprint arXiv:1507.05259 (2015).Google ScholarGoogle Scholar
  80. Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. 2013. Learning fair representations. In International Conference on Machine Learning. 325--333.Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. 2018. Mitigating unwanted biases with adversarial learning. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society. ACM, 335--340.Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. 2017. Men also like shopping: Reducing gender bias amplification using corpus-level constraints. In EMNLP.Google ScholarGoogle Scholar
  83. Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2017. Places: A 10 million Image Database for Scene Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2017).Google ScholarGoogle Scholar

Index Terms

  1. Towards fairer datasets: filtering and balancing the distribution of the people subtree in the ImageNet hierarchy

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader