skip to main content
10.1145/3544548.3581001acmconferencesArticle/Chapter ViewFull TextPublication PageschiConference Proceedingsconference-collections
research-article
Open Access
Honorable Mention

"Help Me Help the AI": Understanding How Explainability Can Support Human-AI Interaction

Published:19 April 2023Publication History

Abstract

Despite the proliferation of explainable AI (XAI) methods, little is understood about end-users’ explainability needs and behaviors around XAI explanations. To address this gap and contribute to understanding how explainability can support human-AI interaction, we conducted a mixed-methods study with 20 end-users of a real-world AI application, the Merlin bird identification app, and inquired about their XAI needs, uses, and perceptions. We found that participants desire practically useful information that can improve their collaboration with the AI, more so than technical system details. Relatedly, participants intended to use XAI explanations for various purposes beyond understanding the AI’s outputs: calibrating trust, improving their task skills, changing their behavior to supply better inputs to the AI, and giving constructive feedback to developers. Finally, among existing XAI approaches, participants preferred part-based explanations that resemble human reasoning and explanations. We discuss the implications of our findings and provide recommendations for future XAI design.

Skip 1INTRODUCTION Section

1 INTRODUCTION

Artificial Intelligence (AI) systems are ubiquitous: from unlocking our phones with face identification, to reducing traffic accidents with autonomous cars, to assisting radiologists with medical image analysis. Being able to better understand these AI systems is becoming increasingly important—although what exactly that means is different in different settings: a smartphone user may want to understand how best to position their face to quickly unlock their phone, a researcher may want to understand what particular design decisions led to an autonomous car accident, and a radiologist may want to understand where the medical decision support tool is looking in suggesting a particular diagnosis.

Over the past years, numerous explainable AI (XAI) methods have been developed to provide transparency into these AI systems and make them more understandable to people (see [2, 7, 27, 41, 46, 48, 49, 109] for surveys). However, arguably these are being developed without embracing the full spectrum of end-user needs. Particularly for computer vision AI systems (such as the ones described above), with millions of model parameters processing thousands of low-level image pixels, translating model outputs into understandable insights is so challenging that proposed XAI methods are frequently limited by what XAI researchers can do rather than what AI end-users might need.

In this work, we connect XAI development with end-users and study a real-world context in which XAI methods might be deployed. Concretely, we set out to answer three research questions:

RQ1: What are end-users’ XAI needs in real-world AI applications?

RQ2: How do end-users intend to use XAI explanations1?

RQ3: How are existing XAI approaches perceived by end-users?

In scoping our study, we focus on Merlin, an AI-based mobile phone application that uses computer vision to identify birds in user-uploaded photos and audio recordings. We chose Merlin because it is a widely-used application that allows us to connect with a diverse set of active end-users. Concretely, we conducted a mixed-methods study with 20 Merlin users who span the range from low-to-high AI background (representing both consumers and creators of AI systems) and low-to-high domain background (representing both users who know less and more about birding than the AI system).

With each participant, we conducted an hour-long interview, which included a survey and an interactive feedback session, to understand their XAI needs, uses, and perceptions. Our study bridges the gap between XAI research done in the HCI and AI communities by directly connecting end-users of a real-world AI application with the XAI methods literature. We do so by mocking up four XAI approaches that could be potentially implemented into Merlin, i.e., heatmap, example, concept, and prototype-based explanations of the AI’s outputs. The mock-up explanations enabled us to get concrete and detailed data about how participants intended to use XAI explanations, as well as how they perceived each approach, in an actual AI use context.

Through our study, we found:

Participants’ XAI needs varied depending on their domain/AI background and interest level. While participants were generally curious about AI system details, those with high-AI background or notably high interest in birds had higher XAI needs. However, participants unanimously expressed a need for practically useful information that can improve their collaboration with the AI, suggesting an important area of focus for future XAI development (RQ1, Sec. 5.1).

Participants intended to use XAI explanations for various purposes beyond understanding the AI’s outputs: determining when to trust the AI, learning to perform the task better on their own without needing to consult the AI, changing their behavior to supply better inputs to the AI, and giving constructive feedback to the developers to improve the AI. This highlights the broad range of XAI needs that should be considered in XAI development (RQ2, Sec. 5.2).

Among existing XAI approaches, participants preferred part-based explanations, i.e., concept [105, 144] and prototype [24, 88] based explanations. Participants found them similar to human reasoning and explanations, and the most useful for the aforementioned purposes. This suggests that to the extent possible, the XAI community should pay particular attention to these methods, despite the challenges with their development and evaluation (RQ3, Sec. 5.3).

Following our findings, we discuss XAI’s potential as a medium for enhancing human-AI collaboration, and conclude with a set of recommendations for future XAI design. However, as with any case study, our findings and recommendations may have limited generalizability. This is an intentional trade-off made to gain an in-depth understanding of end-users’ XAI needs, uses, and perceptions in a real-world context, in line with growing calls for human-centered XAI research [34, 35, 36, 67, 68]. We are hopeful that our study design and insights will aid future XAI research in other contexts.

Skip 2RELATED WORK Section

2 RELATED WORK

2.1 From algorithm-centered to human-centered XAI

With the growing adoption of AI, there has been a surge of interest in explainable AI (XAI) research that aims to make AI systems more understandable to people. XAI is one of the fastest growing fields with hundreds of new papers published each year. See [1, 2, 7, 27, 29, 41, 46, 48, 49, 84, 107, 109, 110, 121] for in-depth surveys, and the following for examples of XAI research done in different disciplines: AI [42, 56, 59, 97, 118], HCI [50, 120, 133, 139], social and cognitive science [18, 26, 78, 80, 122, 124], and philosophy [14, 54, 82]. XAI is also increasingly being researched and applied in various domains, including but not limited to healthcare [5, 71, 75, 100, 117, 119, 135, 141], autonomous driving [9, 77, 94], energy and power systems [72], and climate science [73].

Much of the field’s efforts originally focused on the algorithms, i.e., on providing explanations of AI systems’ inner workings and outputs, rather than the people or the context where these systems are deployed. Recently, there has been a growing recognition that XAI methods cannot be developed “in a vacuum” without an understanding of people’s needs in specific contexts [34, 35, 36, 66, 67, 68]. In response, researchers have proposed conceptual frameworks to characterize XAI needs based on people’s roles [64, 102, 126], expertise [83], or more fine-grained axes of knowledge and objectives [123]. Others interviewed industry practitioners who work on AI products to identify their common XAI needs [15, 52, 66].

We join this relatively new line of research, called “human-centered XAI” [34, 35, 36, 66, 67, 68], and foreground the people who use AI systems and their needs, goals, and contexts in understanding how explainability can support human-AI interaction. In doing so, we build on the aforementioned frameworks to study end-users’ explainability needs. Concretely, we developed a survey based on Liao and colleagues’ XAI Question Bank [66] to collect concrete data on which aspects of AI end-users want to know about.

2.2 Understanding end-users’ XAI needs

Although human-centered XAI is an actively growing area of research, much of the work still focuses on developers rather than end-users of AI systems [15, 52, 66]. This gap is unsurprising, since XAI methods have been primarily developed for and used by developers to inspect AI systems [15, 80]. But it is critical because end-users may have different explainability needs that XAI methods should but don’t yet support.

Recently, some researchers began looking at end-users’ XAI needs in context of specific applications [22, 23, 127]. Tonekaboni and colleagues [127] placed clinicians in hypothetical scenarios where AI models are used for health risk assessment, and found that clinicians wanted to know what features the model uses so they can understand and rationalize the model’s outputs. In a lab setting, Cai and colleagues [23] studied clinicians’ needs in their interaction with a prototype AI model that can assist with cancer diagnoses, and found that clinicians desired overall information about the model (e.g., capabilities and limitations, design objective) in addition to explanations of the model’s individual outputs. In another lab setting, Cai and colleagues [22] examined what needs pathologists have when using a prototype AI model for retrieving similar medical images. They also studied how pathologists use their proposed refinement tools, finding that pathologists often re-purposed them to test and understand the underlying search algorithm and to disambiguate AI errors from their own errors.

These studies delivered rich insights. However, they studied hypothetical or prototype AI applications. Hence, an important question remains, which we tackle in this work: What are end-users’ XAI needs in real-world AI applications? (RQ1). Elish and Watkins [37] recently provided insights into this question through an in-situ study of a deployed, real-world AI system. Specifically, they documented the types of inquiries physicians asked of nurses tasked with monitoring Sepsis Watch [113], an AI system designed to predict patients’ risk of sepsis development. However, they did not study how XAI methods could answer the physicians’ inquiries. In this paper, we take a step further and contribute to understanding how XAI methods can satisfy (or not satisfy) end-users’ needs by studying: How do end-users intend to use XAI explanations? (RQ2) and How are existing XAI approaches perceived by end-users? (RQ3).

Our work extends prior work in three more ways. First, while all aforementioned work [22, 23, 37, 127] studies AI applications that make or support high-stakes medical decisions, we focus on an ordinary application that a diverse set of people use in everyday life. Second, while prior work does not differentiate their participants, we study group differences with respect to domain and AI background levels. We are inspired by recent findings of Ehsan and colleagues [33] on how people’s perceptions of XAI explanations differed based on their AI background. Third, we connect to the XAI methods literature directly, by mocking-up XAI explanations in the studied application. These in-situ mock-up explanations allowed us to gather detailed data on how end-users perceive and intend to use XAI explanations in their actual use of the AI.

2.3 XAI’s role in human-AI collaboration

Our work also connects to the literature of human-AI collaboration [6, 8, 23, 60, 62, 132], sometimes called human-AI teaming [10, 11, 95] or human-AI partnership [89], that studies how people work together with AI to achieve shared goals. In this work, we didn’t set out to study human-AI collaboration. Our use of this term emerged from our findings: while studying participants’ XAI needs, uses, and perceptions, we found that participants described a process for which the language of “collaboration” proved the best fit. Participants described a two-way exchange, where they help Merlin succeed in bird identification and obtain more accurate results in return, and expressed a strong desire to improve their collaboration with XAI explanations and other information. Hence, we give a brief overview of the human-AI collaboration literature and describe how our work connects to existing work.

Prior work has studied how people collaborate with different types of AI systems (e.g., robots [38, 61, 90, 91, 93, 130], virtual agents [8, 25, 31, 92], embedded systems [4, 23, 38, 40, 53, 57, 62, 63, 89, 91, 128]) in different task contexts (e.g., content generation [65, 70, 142], medical diagnosis [23, 40, 128], content moderation [53, 62], deception detection [63, 89], cooperative games [8], and fine-grained visual recognition [38, 57, 90, 91]). Among these, our work is most closely related to [38, 57, 63, 90, 91] that studied XAI’s role in AI-assisted decision making, where AI makes a recommendation and a human makes the final decision. In this work, we explored what role XAI explanations could play in Merlin where for each bird identification, end-users make the final call based on the app’s output and their knowledge of birds and the app.

However, different from our work, [38, 57, 63, 90, 91] focused on measuring the usefulness of specific XAI methods in AI-assisted decision making through lab experiments. These experiments typically consisted of simple tasks (e.g., binary choice) and were conducted with participants recruited from Amazon Mechanical Turk. Further, because they were lab experiments, it was well-defined in advance how participants should use XAI explanations in their collaboration with AI (e.g., look at the provided explanation and judge whether or not to accept the AI’s output). On the other hand, our qualitative descriptive study allowed us to find that participants intended to use XAI explanations for various purposes, highlighting a broad range of XAI needs and uses that should be considered in XAI development.

Figure 1:

Figure 1: Screenshots of Merlin, our study application. Merlin is an AI-based bird identification mobile phone app. Users upload photos on the Photo ID feature (top) or sounds on the Sound ID feature (bottom) to get a list of birds that best match the input. Users also share optional location and season data. The resulting bird list comes with example photos and sounds.

2.4 XAI methods for computer vision

Finally, we review the XAI methods literature to provide background on how we mocked up XAI explanations for Merlin. We focus on methods developed for computer vision AI models because Merlin uses computer vision to identify birds in user-input photos and audio recordings. See [7, 19, 41, 46, 49, 107, 109] for more comprehensive overviews.

XAI methods can be categorized along several axes: first, whether a method is post-hoc or interpretable-by-design; second, whether it provides a global or local explanation; and third, by the explanation form. To begin, the majority of existing XAI methods are post-hoc methods that explain certain aspects of already-trained models [12, 13, 42, 44, 55, 58, 98, 105, 106, 112, 115, 116, 136, 138, 143, 144]. Recently, more interpretable-by-design methods are being proposed; these are typically new types of computer vision models with an explicitly-interpretable reasoning process [17, 20, 21, 24, 28, 30, 59, 88, 103]. Second, XAI methods provide either a local explanation of a model’s individual output or a global explanation of a model and its behavior. Local, post-hoc methods include feature attribution [42, 98, 112, 115, 116, 138, 143], approximation [106], and sample importance [58, 136] methods. Global, post-hoc methods include methods that generate class-level explanations [105, 144] and summaries of what a model has learned [12, 13, 44, 55]. Interpretable-by-design models can provide local and/or global explanations, depending on the model type. Lastly, explanations come in a variety of forms. Representative ones are heatmaps [17, 43, 98, 112, 115, 116, 134, 138, 143], examples [58, 136], concepts [59, 105, 144], and prototypes [24, 28, 88, 91]. To the best of our knowledge, these cover the range of XAI methods for computer vision.

Since we are not affiliated with the Merlin development team and do not have access to its AI models, it was not possible to produce actual explanations of how Merlin identifies birds. Hence, we created mock-up explanations. For comprehensiveness, we mocked up all four aforementioned explanation forms. We know they all are plausible XAI approaches for Merlin because they have been demonstrated on bird image classification models in prior work (e.g., heatmaps in [57, 96, 134], examples in [91], concepts in [59, 104, 105], prototypes in [24, 28, 88, 91]). See Fig. 2 and Sec. 4.2 for the mock-ups and their descriptions, and the supplementary material for details about how we created the mock-ups.

Skip 3STUDY APPLICATION: MERLIN BIRD IDENTIFICATION APP Section

3 STUDY APPLICATION: MERLIN BIRD IDENTIFICATION APP

As described in Sec. 2, we looked for a research setting that involves real-world AI use by end-users with a diverse domain and AI knowledge base, and that people use in ordinary, everyday life scenarios. Furthermore, we looked for a domain with significant AI and XAI research. We found Merlin [125] fit what we were looking for. Merlin is a mobile phone app (Fig. 1) with over a million downloads that end-users, with diverse birding and AI knowledge, use for bird identification as they go out and about outdoors. Most birding apps are digital field guides that don’t use AI (e.g., Audubon Bird Guide [87], iBird Pro Guide [81], Birdadvisor 360° [99]). Merlin is unique in that it uses computer vision AI models to identify birds in user-input photos and audio recordings.

Merlin provided a grounded context with real end-users whose experience we can augment with mock-ups of XAI explanations. Furthermore, a large proportion of XAI methods for computer vision have been developed and evaluated on bird image classification [24, 28, 30, 47, 59, 88, 91, 96, 103, 129] using the Caltech-UCSD Birds (CUB) dataset [131]. Hence, the feedback we collect on the mock-up explanations for Merlin can provide concrete and immediate insights to XAI researchers.

Skip 4METHODS Section

4 METHODS

In this section, we describe our study methods, all of which were reviewed and approved by our Institutional Review Board prior to conducting the study.

Table 1:
Low-AIMedium-AIHigh-AI
Low-domainP7, P12, P16P8, P14P11, P13
Medium-domainP2, P20P1, P4, P10P6
High-domainP5, P17P3, P9, P15P18, P19

Table 1: Participants’ domain (bird) and AI background. See Sec. 4.1 for a description of the background levels.

4.1 Participant recruitment and selection

We recruited participants who are end-users of Merlin’s Photo ID and/or Sound ID, the app’s AI-based bird identification features, with considerations for diversity in participants’ domain and AI background. Concretely, we created a screening survey with questions about the respondent’s domain background, AI background, and app usage pattern (e.g., regularly used app features, frequency of app use). We posted the survey on a variety of channels: Birding International Discord, AI for Conservation Slack, various Slack workspaces within our institution, and Twitter. On Twitter, in addition to posting the survey, we reached out to accounts with tweets about Merlin via @mentions and Direct Messages.

Based on the screening survey responses, we selectively enrolled participants to maximize the diversity of domain and AI background of the study sample. See Tab. 1 for a summary of participants’ background. The subgroups were defined based on participants’ survey responses and interview answers. We refer to individual participants by identifier P#.

Low-domain: From “don’t know anything about birds” (P11, P12) to “recently started birding” (P7, P8, P13, P14, P16). Participants who selected the latter option typically have been birding for a few months or more than a year but in an on-and-off way, and were able to identify some local birds.

Medium-domain: Have been birding for a few years and/or can identify most local birds (P1, P2, P4, P6, P10, P20).

High-domain: Have been birding for more than a few years and/or do bird-related work (e.g., ornithologist) (P3, P5, P9, P15, P17, P18, P19).

Low-AI: From “don’t know anything about AI” (P16, P17) to “have heard about a few AI concepts or applications” (P2, P5, P7, P12, P20). Participants in this group either didn’t know that Merlin uses AI (P12, P16) or knew but weren’t familiar with the technical aspects of AI (P2, P5, P7, P17, P20).

Medium-AI: From “know the basics of AI and can hold a short conversation about it” (P1, P3, P8, P9, P14) to “have taken a course in AI or have experience working with an AI system” (P4, P10, P15). Participants in this group had a general idea of how Merlin’s AI might work, e.g., it is neural network based and has learned to identify birds based on large amounts of labeled examples.

High-AI: Use, study, or work with AI in day-to-day life (P6, P11, P13, P18, P19). Participants in this group were extremely familiar with AI in general and had detailed ideas of how Merlin’s AI might work at the level of specific data processing techniques, model architectures, and training algorithms.

Note that our referral here and elsewhere to “high-AI background” participants describes their expertise with AI in general, not necessarily with Merlin’s AI. All participants were active Merlin users who could provide vivid anecdotes of when the app worked well and not. Regarding frequency of use, 11 participants used it several times a week, 8 used it once a week, and one used it once a month.

4.2 Study instrument

Our interviews were structured in three parts and included a short survey and an interactive feedback session. The complete study instrument is attached in the supplementary material.

Context

First, we asked the participant a series of questions aimed at learning the context of their app use. These include questions about their background; when, where, why, and how they use the app; stakes in their use; and their knowledge and perception of AI.

XAI needs

Next, we inquired about the participant’s explainability needs through open-ended questions and a survey we developed from the XAI Question Bank [66]. The survey lists 10 categories of questions that people might have about an AI system. Nine categories (Data, Output, Performance, How, Why, Why not, What if, How to be that, How to still be this) are from [66], and we added a new Transparency category on expert and social transparency [32]. The survey asks the participant to select questions they “know the answer to” and/or are “curious to know (more).” We directed the participant to interpret “know the answer to” as “have a good idea of the answer” for questions whose exact answers are not available to end-users (e.g., What is the size of the data?).

XAI uses and perceptions

Finally, we assessed the participant’s perception of existing XAI approaches. Using screen sharing during the Zoom interview, we showed three examples of Merlin Photo ID identifications: the first is a correct identification; the second is a misidentification, one that people—even experienced birders—would make; and the third is a misidentification, but one that people wouldn’t make. Using these as running examples, we introduced four XAI approaches one at a time and in random order (see the next paragraph for more information). For each, we asked the participant what they like and dislike about the approach, what they think can be improved, whether they want to see it in the app, and how much it helps them understand the AI’s reasoning and output. See Fig. 2 for the identification examples and XAI explanations shown to participants.

To get detailed answers from participants about XAI uses and perceptions, we created mock-ups of representative XAI approaches that could potentially be embedded into Merlin. These included:

Heatmap-based explanations that highlight regions in the input image that are important for the AI model’s output. They represent feature attribution methods that visualize results via heatmaps (also known as saliency maps) [17, 43, 98, 112, 115, 116, 138, 143] and include popular techniques like Grad-CAM [112].

Example-based explanations that show examples in the training data that are important for the AI model’s output. They include methods that use influence functions [58] and representer points [136] to identify important positive/negative training examples for a particular output.

Concept-based explanations that explain the AI model’s output with text-based concepts. They include concept bottleneck models [59], as well as methods like IBD [144] and ELUDE [105] that generate class-level explanations as a linear combination of concepts.

Prototype-based explanations that explain the AI model’s output with visual prototypical parts. They represent methods such as ProtoPNet [24], ProtoTree [88], and their recent variations [28, 91].

As described in Sec. 2.4, these cover the range of XAI methods for computer vision. In the supplementary material, we describe in detail how we created the mock-up explanations and show examples of explanations beyond bird image classification. We emphasize that the three identification examples are real app outputs that we obtained by uploading photos to Merlin Photo ID. However, the shown XAI explanations are mock-ups that we designed; they are not representative of how Merlin Photo ID actually identifies birds. We made this point clear to participants during the interviews. We also communicated that we were not affiliated with Merlin’s AI development team.

Figure 2:

Figure 2: Identification examples and XAI explanations shown to participants. We showed three real examples of Merlin Photo ID identifications, paired with mock-up explanations we designed for each identification. Each XAI approach was introduced with the following description. Heatmap: The AI model considers the red highlighted regions as evidence for [output, i.e., Evening Grosbeak. House Wren, Ruby-throated Hummingbird]. Example: The AI model considers the input photo most similar to the below photos of [output] the AI model has seen. Concept: The AI model considers the presence of the following concepts as positive/negative evidence for [output]. Prototype: The AI model considers the boxed regions of the input photo similar to parts of [output] photos the AI model has seen. See Sec. 4.2 for details.

4.3 Conducting and analyzing interviews

We interviewed 20 participants, each over a Zoom video call, from July to August 2022. The interviews lasted one hour on average. Participants were compensated with their choice of a 30 USD gift card or a donation to a bird conservation organization made on their behalf. We transcribed the interviews and then analyzed the transcripts. First, two authors read through five transcripts to develop an initial shared codebook, primarily using descriptive coding, i.e., describing participants’ expressions of their beliefs, perceptions, and actions. Descriptions were at the semantic level within the surface meanings of the data (e.g., desire to learn from the AI to improve their task skills). Next, during multiple group meetings, all authors iterated on and refined the codebook, by shifting from descriptions of participants’ beliefs to identification of shared latent, conceptual themes [108]. Desire for improved human-AI collaboration is an example of a latent, conceptual theme we interpreted based on participants’ expressed needs for information that would help them understand the AI’s capabilities and limitations, identify sources of errors, and supply better inputs, in order to more effectively interact with the AI and achieve better outcomes. After we collectively agreed that our conceptual themes were exhaustive, we then identified and eliminated themes which were redundant or overlapping. Once we had a final revised codebook, one author then used this codebook to re-code all of the data. Example codes include: desire to know the AI’s confidence in its output (XAI needs), learn how to take better pictures/audio recordings for the AI (XAI uses), and heatmap-based explanations are too coarse (XAI perceptions). We deliberately did not calculate inter-rater reliability (IRR) as part of our analytic process. McDonald and colleagues [76] argue that such a calculative process is a poor fit for the interpretive paradigm from which qualitative research has developed. Our codebook is derived from our collective and agreed-upon interpretations of our participants’ responses to our questions, and so a mathematical post-hoc comparison of individuals’ coding selections would bring little rigor to the process. Instead, we focused on bringing rigor to our analysis through the discussions and selections involved in the codebook development.

Skip 5RESULTS Section

5 RESULTS

We present our results in three parts. We begin by describing participants’ explainability needs (RQ1, Sec. 5.1). We then describe how they intended to use XAI explanations (RQ2, Sec. 5.2). Finally, we describe how they perceived the four representative XAI approaches we mocked-up for Merlin (RQ3, Sec. 5.3).

5.1 XAI needs: Participants desired more information about AI, especially practically useful information that can improve their collaboration with AI

Based on open-ended questions and the survey we developed from the XAI Question Bank [66], we found that while participants were generally curious about AI system details, only those with high-AI background and/or high-domain interest were willing to actively seek out this information (Sec. 5.1.1). However, participants unanimously expressed a need for information that can improve their collaboration with the AI system (Sec. 5.1.2).

5.1.1 Participants were generally curious about AI system details, but curiosity levels differed based on AI background and domain interest.

As most other AI applications, Merlin does not provide much information about its underlying technology. Hence, when we asked participants what they knew about the app’s AI, all replied that they didn’t know much about system details, although those with high-AI background (P6, P11, P13, P18, P19) had detailed guesses about the app’s data, model architectures, and training algorithms.

So what did participants want to know? According to the survey results, participants wanted to know everything about the app’s AI. For all questions in the survey, most if not all participants responded they “know (or have a good idea of) the answer” and/or are “curious to know (more).” That is, participants were curious about overall system details (questions in the Data, Output, Performance, How, Transparency categories), as well as how the AI reasons and makes judgments on specific inputs (questions in the Why, Why not, What if, How to be that, How to still be this categories). We report the full survey results in the supplementary material.

But how curious are they, really? When we tempered self-reported levels of curiosity with interview questions about the effort participants were willing to invest to satisfy that curiosity, the picture changed. “I wouldn’t go tremendously out of my way to find the answer to these questions” (P12) was a sentiment shared by many participants (P1, P5, P6, P7, P9, P10, P12, P13, P16, P20). For instance, P5 said: “If there’s an opportunity that arises, I’d love to ask about it [...] but I don’t think I would be contacting people at Cornell [app developers].” Other participants were open to searching around a bit (P9, P10), listening to talks or podcasts (P12), or reading some documentation if easily available (P1, P6, P7, P13, P16, P20), but didn’t want to take the initiative to seek out more information about the AI system, as described by the questions in the survey.

Exceptions were some participants with high-AI background (P11, P18, P19) or notably high interest in birds (P1, P4, P8). P11, P18, and P19, likely because they develop AI systems in their work, were very curious about the app’s AI and were willing to go to the extent of reaching out to the app developers (P11, P18) or playing with the data themselves (P19). For example, P19 said: “I’d love to talk to one of the engineers and pick their brain [...] or get some data and play with it myself.” P1, P4, P8 have medium-AI background, but their exceptionally high interest in birds seemed to fuel their curiosity about the app’s AI. They were particularly curious about how the AI tackles difficult identifications such as mockingbirds that mimic other birds or birds that are difficult for experienced human birders to identify (e.g., “little brown birds”).

In contrast, participants with low-to-medium AI background (P7, P8, P9, P10, P12, P16) had lower explainability needs. For instance, P7, P8, and P10 had little-to-no interest in how the AI reasons and makes judgments on specific inputs. P8 said questions in the Why, Why not, What if, How to be that, How to still be this categories were not what they would ever think about on their own. P7 expressed more bluntly that they prefer to keep the AI as a black box: “No, I don’t want to ruin the mystique.” P9, P12, and P16, on the other hand, became more curious during the interview, however, their responses suggest that they were not very curious about the AI in their natural use environment prior to the interview.

In short, all participants were interested in learning more about the AI, but only those with high-AI background and/or high-domain interest were willing to expend effort to gain more information about the AI’s system details.

5.1.2 Participants desired information that can improve their collaboration with AI.

Participants’ expressed needs for explanation shifted, however, when our interview questions moved away from gauging their curiosity about AI system details, and towards querying their use of the app. While participants’ needs for system details differed based on background and interest, they unanimously expressed a need for practically useful information that could improve their collaboration with the AI system.

To begin, participants wanted a general understanding of the AI’s capabilities and limitations (P1, P4, P5, P16, P19, P20). P1 described a number of ways this understanding would help their use of the app: “It would definitely first help me understand more about when certain identifications may be more or less reliable. But also it will help me supply better inputs to the app to try and get the best quality identification results that I can” (P1). Participants had already tried to gain this understanding by pushing the AI to its limits (P4, P5, P16, P19, P20). Some had tried to fool the AI with non-bird sounds (e.g., sounds of other animals, bird impersonations) to understand when it works and when it breaks (P4, P5, P16, P19). Others had conducted more rigorous experimentation by altering their input (e.g., clip the audio recording, remove location information) and observing changes in the AI’s output to understand what factors influence the AI’s output and how (P4, P20).

Another frequently expressed need was for a display of the AI’s confidence (P1, P2, P3, P4, P6, P13, P18, P20). Participants wanted this information to better determine when to trust the AI’s output. Concretely, P2 demanded for percentage-based confidence scores: “If it doesn’t give a percentage [...] I just don’t have a gauge of how correct it is” (P2). P7 requested the AI to qualify its output by saying “it may not be the exact match” or give a general answer (e.g., “we don’t know the exact species but this bird is in the Wren family”).

Lastly, participants wanted the AI to give more detailed outputs (P2, P10, P11, P12). They demanded information that would help them verify the AI’s output. For instance, P10 wanted the AI to “highlight the time period of the [sound] clip that it calls a certain species” because it is hard to know which sound corresponds to which bird when multiple birds are singing at once. Going a step further, P2, P11, and P12 wanted the AI to specify the type of bird sound it heard. Currently, the verification process is arduous because each bird species has a number of songs and calls, as well more specific sounds such as juvenile calls, flock calls, and alarm calls. They said the suggested features will make the verification process easier and provide more information about how the AI has made its identification, with which they can more readily check the AI’s output and determine whether to trust it.

In sum, when we queried participants about their actual, real-world use of the app, they expressed a desire for information which could improve their use of the app, particularly in deciding whether or not to trust the AI’s outputs. Intriguingly, they expressed these desires before we showed them our mock-ups of what XAI explanations for the app might look like. This suggests that these XAI needs were not prompted solely by seeing XAI explanations.

5.2 XAI uses: Participants intended to use explanations for calibrating trust, improving their own task skills, collaborating more effectively with AI, and giving constructive feedback to developers

Next, when we showed XAI explanations to participants, they were excited to use them for various purposes beyond understanding the AI’s outputs: for determining when to trust the AI (Sec. 5.2.1), which is a well-known use and commonly-stated motivation for XAI [39, 79, 111, 137, 140], but also for learning to perform the task better on their own (Sec. 5.2.2), changing their behavior to supply better inputs to the AI (Sec. 5.2.3), and giving feedback to the developers to improve the AI (Sec. 5.2.4), which are less discussed uses in existing literature.

5.2.1 Participants intended to use explanations to determine when to trust AI.

Many participants said they would use explanations to determine when to believe the app’s identification result (P1, P4, P8, P11, P13, P18, P20). The need underlying this use is consistent with the aforementioned need for information that helps them decide when to trust the AI. While looking at different explanation mock-ups, participants gave examples of when their trust would increase and decrease. For instance, participants said they would feel more confident in the AI’s output when heatmap-based explanations show that the AI is “looking at the right things” (P8) and when example-based explanations show example photos that look similar to their input photo. Conversely, they said they would feel more skeptical when heatmap-based explanations suggest that an “artifact was important” (P8), when concept-based explanations have errors in their concept recognition (e.g., says there is a long beak when there is not) (P18), and when prototype-based explanations match photo regions and prototypes that don’t look similar to them (P4). These findings confirm existing literature [39, 79, 111, 137, 140] and suggest that trust calibration will be an important use of XAI.

5.2.2 Participants desired to learn via explanations to better achieve the task on their own.

Intriguingly, a greater number of participants said that they intend to use explanations to improve their task skills (P1, P2, P4, P6, P7, P8, P9, P10, P11, P13, P15, P17, P19, P20). Participants viewed the AI as a teacher and were keen to learn the features it looks at via explanations, so they can look for these features in the future when they are birding on their own. Participants were aware that the features the AI looks at may be different from what expert human birders look at. But they weren’t very concerned about the potential differences. One participant even said it would be interesting if the AI finds new ways of identifying birds and explanations can “call attention towards things that people did not really think of before” (P1). Still, participants preferred that explanation forms be similar to those of human birders. We elaborate on this point further in Sec. 5.3.

Overall, participants were excited about how explanations could make birding more accessible for themselves and others who lack access to expert resources (e.g., mentoring from human birders):

It [the explanation] is kind of training or giving me more information and I’m kind of learning these things [what features to look at]. Whereas before, birders or ornithologists are learning this from mentors or teachers in the field. But those opportunities are limited based on social relations, privilege, how closely you are are connected to birding groups and stuff. And so it will be much more openly accessible if that kind of more comparative identification knowledge was accessible through just an app.” – P1

Even participants with high-domain background, whose main goal for using the app was not to obtain such knowledge, appreciated the educational value of explanations and said explanations would help them learn faster (P16).

These findings are closely related to recent works by Goyal and colleagues [47] and Pazzani and colleagues [96]. They demonstrated that XAI explanations help non-bird-experts (graduate students in machine learning [47] and undergraduate students in psychology, cognitive science, or linguistics courses [96]) learn to distinguish birds. While their experiments employed relatively easy tasks, i.e., assigning bird images to one of two species options, they showed the potential of learning from AI via XAI explanations. While [47, 96] did not establish that this is a need that people have, our work provides empirical evidence for it, suggesting learning from AI as another important use case for XAI.

We postulate this use case stemmed from Merlin’s status as an expert AI system. Many AI applications are deployed to automate tasks that are easy for people (e.g., face verification, customer service chatbot) in settings where it is costly or implausible to have humans in the loop. In contrast, Merlin possesses expertise that most people don’t have and need to invest time and effort to gain. This expertise is likely the source of Merlin explanations’ educational value. In other types of AI applications, end-users may not intend to learn from AI via explanations.

5.2.3 Participants viewed explanations as an opportunity to be better AI-collaborators.

Participants also saw explanations as an opportunity for action. They looked for feedback on their own behavior that would in turn enable them to help the AI better achieve the task (P1, P7, P9, P20). P20 said explanations, by providing insights into how the AI got an identification wrong, can help them figure out the answer to: “What would I have to do to change this photo to make it [AI] understand it better?” Participants sought out opportunities to improve their own collaborative skills when working with the AI to achieve a task, because at the end of the day they want to achieve best possible outcomes:

You’re still trying to look for the right bird. So if you can adjust human behavior to get the right answer out of the robot [AI], then that’s helpful.” – P20

Because of this need, participants were critical towards XAI approaches they thought didn’t provide actionable feedback. For instance, P9 questioned the utility of heatmap and example-based explanations: “How is it helpful to the user in the future? Besides just being cool and interesting? How does it change the user’s use of the app? Does it make you take a different photo?” They critiqued that these approaches don’t help them help the AI be more correct.

We view use this intended use of XAI explanations as an extension of participants’ current efforts to help out the AI. When describing their use of the app, participants mentioned several different ways they help the AI perform better. Some were smaller adjustments on the spot, such as facing the microphone closer to the bird and getting a sufficiently long recording for Sound ID (P9). Others were more involved, such as the efforts P1 described as part of their “general workflow” for using Photo ID:

I basically don’t use images that are either too blurry or do not feature the bird in an unobstructed manner. I know from my personal experience using it that Merlin works a lot better if it has a more silhouetted side profile shot of the bird. [...] So I try to feed Merlin photos taken from similar angles, also in acceptable lighting conditions. I might have to boost the contrast or the brightness of a picture artificially to feed it into Merlin to get better results. If there’s no real contrast, then it’s much harder to get credible results.” – P1

In short, participants viewed the AI as a collaborator. They have already found ways to better work with it, and they intended to use XAI explanations to further improve their collaboration. To this end, they wanted explanations to provide actionable feedback on their own behavior so that they can supply better inputs to the AI.

5.2.4 Participants saw explanations as a medium to give feedback to developers and improve AI.

Finally, participants with high-AI background intended to use explanations as a medium to give feedback to developers and contribute to improving the AI (P13, P18, P19). These participants mentioned that explanations, by providing more information to end-users about how the AI produced its output, enable end-users to give more detailed feedback. This feedback can then help developers improve the AI system. P13 illustrated this process using prototype-based explanations as an example:

The fact that it [AI] identifies parts of the tree, that’s a great opportunity to [to have end-users] tap that region and say ‘not a part of the bird’ so that you can get the users helping you to do some curation and labeling on the images, which someone could review or whatever. You can make much higher quality models by getting this sort of the labeling right.” – P13

P18 suggested a similar feedback process for example-based explanations. They said when end-users disagree with the provided examples of similar looking birds, they can correct them by saying “no, I think it actually looks more like bird number three” and help developers align the AI’s notion of perceptual similarity with that of humans, and improve the AI.

Lastly, P19 described XAI’s potential for creating a positive feedback loop that helps both end-users and the AI system:

So there’s a feedback loop here, right? Because if that [teaching people to better identify birds] is your goal, and you’re successful in doing that, then you’re able to rely on people to verify their data, contribute solid data, and that data can help inform Merlin, which makes Merlin better, which makes it do its job better. [...] I think no matter what, it [providing explanations] is kind of beneficial.” – P19

P13 and P18 shared this view and said they would be excited to help developers improve the app by providing feedback via explanations. P18, in particular, expressed a strong desire to contribute. They had already been signing up for beta versions of the app, and the first answer they gave to the question “What would you like to know more about Merlin?” was: “How I can contribute more” (P18).

In short, participants with high-AI background desired to use explanations to help improve the AI, so that they can achieve better outcomes with it in the future. We interpret this as another example of participants viewing the AI as a collaborator whom they work together with.

5.3 XAI perceptions: Participants preferred part-based explanations that resemble human reasoning and explanations

In this last results section, we describe how participants perceived the four XAI approaches we mocked up: Heatmap (Sec. 5.3.1), Example (Sec. 5.3.2), Concept (Sec. 5.3.3), and Prototype (Sec. 5.3.4). We also summarize concerns expressed toward explanations (Sec. 5.3.5), and explore how existing XAI approaches might satisfy end-users’ explainability needs and goals identified in the previous sections.

5.3.1 Heatmap-based explanations: Most mixed opinions.

We received the most mixed reviews for heatmap-based explanations. Participants who liked heatmaps described them as “fun” (P15), “aesthetically pleasing” (P3), and intuitive—“it’s very easy, it hits you right away” (P9). Some participants were positive because they often use heatmaps in their work and find them helpful for representing information (P12, P19). Conversely, a few participants expressed a strong dislike (P14, P16), e.g., “I hate those things [...] They are simply not intuitive” (P14). P20 didn’t like heatmaps as an explanation form because “heatmaps feel like they should be related to weather,” revealing individual differences in perception.

Regarding utility, some said heatmaps help them understand how the AI had made a mistake (P7, P9, P13). For instance, P19 said they see how the AI made a mistake for the Marsh Wren photo because the heatmap (in Fig. 2) did not highlight areas that are important for distinguishing different species of Wrens (e.g., Marsh Wren has a white eyebrow that House Wren doesn’t). However, many participants criticized that the shown heatmaps were too coarse and uninformative (P1, P2, P3, P4, P6, P10, P11, P16, P17, P19). “It’s just highlighting the bird” was a common remark. Participants said heatmaps would be more helpful if they highlight a few salient features of the bird, just like how human birders focus on a few field markers when identifying birds.

Finally, some participants thought heatmap-based explanations were inherently limited by its form. P9, P11, and P17 said heatmaps were unsatisfying because they don’t answer the “why” question. Regarding heatmaps’ highlighted regions, P17 asked: “Yes it’s important, but why was it important?” Other participants were dissatisfied because heatmaps lacked actionable information (P9, P11). They said knowing which parts of the photo were important to the AI does not help them change their behavior to help the AI be more correct in future uses.

Table 2:
XAI approachPraisesComplaints
HeatmapIntuitive, pleasingUnintuitive, confusing
Helpful for spotting AI’s mistakesUninformative, too coarse
Doesn’t explain why certain parts are important
Doesn’t give actionable information
ExampleIntuitiveUninformative, impression-based
Helpful for verifying AI’s outputsDoesn’t add much to current examples
Allows end-users to do their own moderationDoesn’t give actionable information
ConceptPart-based formCurrent concepts are too generic
Resembles human reasoning and explanationsMeaning of coefficients is unclear
Helpful for verifying AI’s outputsNumbers are overwhelming
Helpful for learning bird identification
Final scores and coefficients are helpful
PrototypePart-based formCluttered, difficult to see on small screens
Resembles human reasoning and explanationsSome prototypes are ambiguous and uninteresting
Intuitive, visual
Helpful for verifying AI’s outputs
Helpful for learning bird identification

Table 2: Summary of participants’ feedback on four XAI approaches. See Sec. 5.3 for details.

5.3.2 Example-based explanations: Intuitive but uninformative.

There was a consensus among participants that example-based explanations are “really easily understandable.” However, opinions diverged regarding their utility. Some found them helpful for determining when to trust the AI (P4, P5, P17, P20) since they themselves can compare their input photo to the examples in the explanations. P4 noted that example-based explanations feel “much more collaborative” since they allow end-users to do their own moderation of the provided information. P19, on the other hand, were concerned that they would “open the door for user error.” Especially for difficult identifications where there are only subtle differences between candidate birds, P19 said example-based explanations wouldn’t help non-bird-expert end-users arrive at a more accurate identification.

Many participants described example-based explanations as rather uninformative (P1, P4, P6, P8, P10, P11, P12, P18). Some thought they didn’t add much information to example photos that are already shown in the app with the identification result (P1, P6, P10, P11). They understood the difference between the two, that example-based explanations convey what the AI considers similar to the input photo, while the currently provided example photos are part of a fixed bird description and independent of the input. Still, they thought the explanations were not very useful. Some even preferred the current example photos because they are high-quality and well-curated (P1, P6).

Another frequent criticism against example-based explanations was that they are too general and impression-based (P4, P8, P10, P12, P18). Participants were frustrated that they don’t communicate what features the AI was using to make its identifications, e.g., P8 said “This kind of tells you nothing.” Due to this lack of specificity, many mentioned that example-based explanations were not helpful for their various intended uses, ranging from understanding the AI’s reasoning to supplying better inputs to the AI to improving their own bird identification skills.

5.3.3 Concept-based explanations: Well-liked overall but overwhelming to some.

Participants were largely positive towards concept-based explanations. Most praises were about their part-based form. They liked that the AI’s output was broken down into chunks that human birders reason with, i.e., concepts (P3, P4, P11). “This is what a person looks for basically when they’re identifying a bird,” remarked P3. Relatedly, participants liked that concept-based explanations resemble the way bird identifications are taught and shared between birders (P3, P8, P17). P17 said, “before all this technology, this is exactly how you would basically learn to ID a bird.” For these reasons, participants mentioned that concept-based explanations seem helpful for learning to identify birds on their own.

Participants also mentioned other use cases where concept-based explanations can help. For instance, P11 said they would allow people to check the AI’s output more thoroughly because people can agree or disagree with the explanation at the level of individual concepts. As an example, they said they would not believe the AI’s output if the explanation says there are red feathers in the photo when there are not. Participants also liked that the shown explanations provided a final score for the output because they display the AI’s confidence in the identification (P1, P5, P17). P5 said such scores would be particularly helpful when they are comparing similar-looking candidate birds.

Nonetheless, participants mentioned a few areas of improvement. Several participants pointed out that the concepts in the shown explanations (e.g., long beak, black feathers, white body) were too general (P1, P4, P5, P10). They suggested adopting birders’ language and describing birds with more specific terms such as “underbelly, chest, rump, wing, wingbars, neck, head, cap” (P4). Participants also recommended making the numbers in the explanations as easily understandable as possible (P6, P9, P12, P13, P15, P16, P18). P6 pointed out that the current concept coefficients are confusing: “I have no idea what any of the numbers mean? Like is 1.7 good?” Specifying what are good and bad numbers and constraining the coefficients’ range may mitigate some of the confusions. Even with these changes, however, concept-based explanations may not be everyone’s cup of tea. Some participants shared that they find the explanation form inherently overwhelming and less attractive (P5, P13, P16, P20). P16 shared: “I sort of tune out with numbers after a while.” P20 also expressed their preferences for more visual explanations: “I’m such a visual person that stuff like this would go right over my head and make no sense for the most part.

5.3.4 Prototype-based explanations: Most preferred.

Many participants picked prototype-based explanations as their favorite (P2, P3, P4, P6, P7, P9, P10, P12, P13, P15, P16, P17, P19, P20). The part-based form was clearly preferred, for similar reasons mentioned for concept-based explanations. P15 and P20 said prototype-based explanations are analogous to how they think about birds, and P1 that they are analogous to how birders teach each other. Between prototypes and concepts, participants tended to prefer prototypes for their visual nature and information content: prototype-based explanations locate and draw a box around relevant bird parts in the user-input photo, whereas concept-based explanations only list the bird parts. P4 summarized the advantages: “It makes a very clear match between the photo that you’re looking at and a larger base of what this bird should look like. It also skips over the whole language issue and is incredibly visual which I really appreciate.” Participants also noted that prototype-based explanations can help many uses, e.g., learning how to identify new birds (P2, P8, P13, P15, P19, P20), understanding how the AI is working (P11, P13, P15, P16, P20), spotting the AI’s mistakes (P4, P13), and changing their own behavior to supply better inputs to the AI (P20).

Complaints against prototype-based explanations were mostly minor. Some participants described the current version as “cluttered” and “difficult to see” (P1, P4, P5, P6, P11) and made UI design recommendations, e.g., having one prototype-photo region match pop up at a time (P11). Participants also mentioned that some prototypes were ambiguous (P2, P11, P18). For instance, P11 said they had to “examine the prototype and the example to figure out what the concept was that they corresponded to.” As a solution, P2 suggested providing a textual description of the prototype. Another complaint was that some prototypes (e.g., feet) were uninteresting (P1, P13, P18). “Very few bird species are differentiated based on their feet,” remarked P1. For solving this problem, participants suggested curating prototypes with domain experts and end-users so that the explanation focuses on salient and interesting features, those that human birders would use to identify birds.

Finally, several participants suggested combining prototype-based explanations with other approaches (P2, P4, P11, P12, P16, P18, P19). Concretely, P2 suggested combining it with heatmap-based, P2, P12, P16 and P18 with concept-based, and P4 and P11 with example-based explanations. P19 didn’t specify an approach. Regarding the combination, some suggestions were general (e.g., show both types of explanations) while others were more specific (e.g., add concept labels to prototypes). P12 and P18 particularly advocated for using information from multiple sources (e.g., photo, sound, location) for both improving the AI’s performance and explaining its results to end-users.

5.3.5 Concerns about XAI explanations.

Participants were overall excited to see XAI explanations in Merlin, however, some expressed concerns regarding the faithfulness and potential negative effects of explanations. In particular, participants who were familiar with XAI questioned how faithfully the shown approaches would explain the app’s identification process, if they were to be implemented in the app (P6, P10). For example, P6 said example-based explanations feel like “cheating interpretability” unless the AI actually makes identifications using clustering or other techniques that group similar photos together. Regarding concept-based explanations, P6 and P10 asked if they imply that the AI system is interpretable-by-design and actually reasons in two steps (first concept recognition, then bird identification), or if they were post-hoc explanations produced by a separate “explainer” system. These questions highlight the importance and challenges of communicating what XAI explanations are actually showing. In some cases, explanations of XAI explanations (“meta-explanations”) may be more complex than the XAI explanations themselves.

Another concern was that explanations might lead to mistrust or overtrust in AI systems. P20 said a convincing explanation for a misidentification would be “detrimental” to end-users who are trying to learn bird identification on their own, because they might more readily believe in the misidentification and accumulate wrong knowledge. Similarly, P19 said explanations might encourage end-users to “double down on the incorrect identification,” and even create a negative feedback loop if the AI system relies on end-users to input or verify data. These concerns are consistent with findings from recent research [57, 101] that people tend to believe in AI outputs when given explanations for them, and raise caution against negative effects explanations might have on end-users, irrespective of XAI designers’ intent.

Skip 6DISCUSSION Section

6 DISCUSSION

6.1 XAI as a medium for improving human-AI collaboration

The most surprising finding from our study was the degree to which Merlin end-users wanted to improve their collaboration with the AI system through XAI explanations. Participants desired information upon which they can act. In particular, they wanted XAI explanations to help them help the AI, e.g., by supplying better inputs to the AI and providing constructive feedback to developers. We found this an intriguing re-purposing of explanations, which are typically developed to help people understand AI’s inner workings and outputs. Collaboration is distinct from usability. Usability is already often discussed in the XAI literature, where XAI is presented as a means to provide meaningful information about how an AI output is reached, so that users can “make decisions more quickly, or to increase decision quality” [64]. However, our participants desired information that not only improves their decisions based on the AI’s outputs (usability), but also empowers them to help the AI be more accurate in the first place and achieve better outcomes together (collaboration).

For designing XAI that supports human-AI collaboration, research in the accessibility field can be instructive. Bigham and colleagues’ work on the VizWiz system [16] combined computer vision with human labor to support blind end-users in solving daily visual problems. The VizWiz system allowed end-users to upload a photo of their environment and ask remote workers visual questions (e.g., where is the tomato soup can?). In their study, the authors found that input photos presented challenges to the VizWiz system because the blind end-users provided photos which were “often blurred, tilted, varied in scale, and improperly framed.. and susceptible to problems related to lighting” [16]. To overcome these challenges, the authors developed a subsystem that uses computer vision to provide end-users with prompts around lighting, centering, and distance-to-object. These prompts helped end-users take photos that help remote workers better answer their questions. Like how VizWiz’s subsystem helped end-users collaborate with VizWiz and remote workers, we argue XAI explanations can and should serve as a medium for improving end-users’ collaboration with AI systems. In addition to providing technical transparency, XAI explanations should move towards providing actionable feedback to and from end-users, empowering end-users to have more rich and meaningful interactions with AI systems.

6.2 XAI design for Merlin

So what should Merlin’s XAI explanations look like? In this section, we propose a design based on participants’ perceptions of the four XAI approaches (Sec. 5.3 and Tab. 2). We hope this serves as a helpful example of how end-users’ feedback can inform and guide the design of XAI explanations.

Participants preferred specific, part-based explanations that resemble human reasoning and explanations. Participants repeatedly demanded that explanations highlight a few specific features that the AI uses to make its identifications. XAI approaches that best satisfied this need were concept and prototype-based explanations that break down the AI’s output with human-digestible units of information, i.e., concepts and prototypes. Participants also appreciated how similar concept and prototype-based explanations were to the way human birders identify birds and explain their identification to others. Heatmap and example-based explanations were generally less preferred: many participants noted that they were too coarse and impression-based to be useful.

At the same time, participants wanted explanations to be easily understandable. “I don’t want to spend extra time trying to understand what I’m looking at” (P5) was a common sentiment. However, what counts as easily understandable differed between participants. For instance, participants with low-AI background found concept-based explanations overwhelming due to the numbers and calculations. Contrarily, participants with high-AI background found all explanations intuitive, with some wanting even more numbers and other technical details. These findings underline the importance of our research agenda, as end-users’ XAI needs, perceptions, and variations therein cannot be anticipated by studying only AI experts, which have made up a primary study population in XAI research.

Based on these observations, if we were to design XAI explanations for Merlin, we would combine prototype and concept-based. For simplicity, we would start with a visualization of the user’s input photo overlaid with boxes indicating regions matched to prototypes. We would then give the user an option to tap on each box to get more information. The pop-up would show the prototype matched to the boxed region, accompanied with a short description, i.e., concept name. As some participants mentioned, prototypes may be ambiguous. For instance, a prototype visualizing a part of a bird’s wing may not be informative if end-users don’t know what is special about that part. A text description (e.g., white stripes across the folded wing) would help remove the ambiguity. Further, we would curate the set of prototypes for each bird with end-users and domain experts. We would exclude prototypes that are uninteresting (e.g., feet) or redundant, and limit the total number of prototypes so as to not overwhelm users. Finally, we would optionally provide similarity scores, coefficients, and final class scores for end-users who are interested in diving into the details.

6.3 Implications for future XAI research

Our findings have broader implications for XAI research beyond designing explanations for Merlin. Below we discuss directions we believe XAI explanations should be improved to better serve the needs of AI system end-users.

(1)

Explanations should be designed with end-users. Participants often exposed blind spots in existing XAI approaches, revealing a creator-consumer gap in XAI research [33]. For example, they pointed out that the concepts used in concept-based explanations were disconnected from birders’ language (Sec. 5.3.3). The shown concepts (e.g., white body, long wings) were too generic compared to birders’ field mark terms (e.g., wingbar, supercilium). Participants also proposed solutions, e.g., develop the bank of concepts with end-users, and offered to contribute their experience and domain expertise. This example highlights the need for end-users’ participation in the explanation design process and calls for more participatory approaches [86] to XAI research.

(2)

Explanations should answer “why” not just “what.” Several of our participants were unsatisfied with existing XAI approaches that only explain “what” features the AI system was using to produce its output, e.g., heatmap explanations that highlight “what” image regions were important but don’t explain “why” those regions were important (Sec. 5.3.1). They expressed a desire for explanations that answer “why” question so that they can gain a deeper understanding of the AI’s reasoning process. Explaining causal relationships in computer vision AI models is an open problem, however, it is promising that more researchers are tackling causal questions in XAI research [85].

(3)

Explanations should use multiple forms and modalities if warranted. There is no reason for explanations to be limited to one form or modality. Participants often suggested combining two or more XAI approaches to produce more informative explanations (Sec. 5.3.4). They also questioned why Merlin’s identification features (i.e., Photo ID, Sound ID) and our explanation mock-ups were not multimodal, when human birders combine evidence from as many sources as possible (e.g., photo, sound, location) for more accurate bird identification. Expanding the design space of explanations will lead to XAI methods that better satisfy end-users’ needs.

(4)

Explanations should be rigorously evaluated. Explanations sometimes have (unintended) negative effects. Recent works have revealed that explanations can engender over-trust in AI or give misleading understandings [3, 33, 51, 57, 69, 74, 90, 114]. Our participants were also concerned about the faithfulness and potential negative effects of explanations (Sec. 5.3.5). To preemptively, not reactively, address these issues, it is crucial to rigorously evaluate XAI methods on both their method goals and use case goals throughout the development process.

Skip 7LIMITATIONS AND FUTURE WORK Section

7 LIMITATIONS AND FUTURE WORK

First, as most of our interview questions and study materials pertain to the Merlin app, our findings may not generalize to other contexts. This is an intentional trade-off made in favor of deepening our understanding of end-users’ XAI needs, uses, and perceptions in a specific context. However, our study design can aid future research on other types of AI applications and the disparate domains and contexts into which they are integrated. Another limitation is that we had relatively few participants in some of the background subgroups; in future research we aim to have more participants from these subgroups. Finally, we did not have access to other stakeholder groups, such as the developers and deployers of the Merlin app. However, different stakeholders in AI systems might have different needs for XAI explanations [45, 64]. We plan to pursue additional research with a more comparative design and study if and how needs differ across stakeholder groups.

Skip 8CONCLUSION Section

8 CONCLUSION

There is a lack of empirical studies examining end-users’ explainability needs and behaviors around XAI explanations in real-world contexts. This is important to address in order to make AI systems accessible, usable, and understandable for more people. In this work, we conducted a qualitative, descriptive, and empirical study with 20 end-users of the Merlin bird identification app. Our questions revolved around real-world needs and usage, with a goal to surface insights which can be utilized to improve the design of XAI explanations. We found interesting insights into the actionability with which participants collaborate with the AI system and desire to use explanations to improve their collaboration. We also gathered concrete feedback on four representative XAI approaches that could be potentially embedded into Merlin, finding that participants prefer part-based explanations that resemble human reasoning and explanations. Further, participants’ feedback revealed a creator-consumer gap in XAI, highlighting a need of involving end-users in the XAI design process. Based on our findings, we provided recommendations for future XAI research and design.

Skip ACKNOWLEDGMENTS Section

ACKNOWLEDGMENTS

We foremost thank our participants for generously sharing their time and experiences. We also thank Tristen Godfrey, Dyanne Ahn, and Klea Tryfoni for their help in the interview transcription. Finally, we thank members of the Princeton HCI Lab and the Princeton Visual AI Lab, especially Amna Liaqat and Fannie Liu, and the anonymous reviewers for their thoughtful and helpful feedback.

This material is based upon work partially supported by the National Science Foundation (NSF) under Grant No. 1763642 awarded to OR. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF. We also acknowledge support from the Princeton SEAS Howard B. Wentz, Jr. Junior Faculty Award (OR), Princeton SEAS Project X Fund (RF, OR), Princeton Center for Information Technology Policy (EW), Open Philanthropy (RF, OR), and NSF Graduate Research Fellowship (SK).

Footnotes

  1. 1 In this paper, we use the term “XAI explanations” to refer to explanations produced by XAI methods to explain specific AI system outputs.

    Footnote
Skip Supplemental Material Section

Supplemental Material

3544548.3581001-video-figure.mp4

mp4

13.1 MB

3544548.3581001-talk-video.mp4

mp4

89.8 MB

3544548.3581001-video-preview.mp4

mp4

1.9 MB

References

  1. Ashraf Abdul, Jo Vermeulen, Danding Wang, Brian Y. Lim, and Mohan Kankanhalli. 2018. Trends and Trajectories for Explainable, Accountable and Intelligible Systems: An HCI Research Agenda. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (Montreal QC, Canada) (CHI ’18). Association for Computing Machinery, New York, NY, USA, 1–18. https://doi.org/10.1145/3173574.3174156Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Amina Adadi and Mohammed Berrada. 2018. Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI). IEEE Access 6(2018), 52138–52160. https://doi.org/10.1109/ACCESS.2018.2870052Google ScholarGoogle ScholarCross RefCross Ref
  3. Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt, and Been Kim. 2018. Sanity Checks for Saliency Maps. In Neural Information Processing Systems (NeurIPS).Google ScholarGoogle Scholar
  4. Kasun Amarasinghe, Kit T. Rodolfa, Sérgio Jesus, Valerie Chen, Vladimir Balayan, Pedro Saleiro, Pedro Bizarro, Ameet Talwalkar, and Rayid Ghani. 2022. On the Importance of Application-Grounded Experimental Design for Evaluating Explainable ML Methods. https://doi.org/10.48550/ARXIV.2206.13503Google ScholarGoogle ScholarCross RefCross Ref
  5. Anna Markella Antoniadi, Yuhan Du, Yasmine Guendouz, Lan Wei, Claudia Mazo, Brett A. Becker, and Catherine Mooney. 2021. Current Challenges and Future Opportunities for XAI in Machine Learning-Based Clinical Decision Support Systems: A Systematic Review. Applied Sciences 11, 11 (2021). https://doi.org/10.3390/app11115088Google ScholarGoogle ScholarCross RefCross Ref
  6. Ines Arous, Jie Yang, Mourad Khayati, and Philippe Cudré-Mauroux. 2020. OpenCrowd: A Human-AI Collaborative Approach for Finding Social Influencers via Open-Ended Answers Aggregation. In Proceedings of The Web Conference 2020 (Taipei, Taiwan) (WWW ’20). Association for Computing Machinery, New York, NY, USA, 1851–1862. https://doi.org/10.1145/3366423.3380254Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Alejandro Barredo Arrieta, Natalia Díaz-Rodríguez, Javier Del Ser, Adrien Bennetot, Siham Tabik, Alberto Barbado, Salvador García, Sergio Gil-López, Daniel Molina, Richard Benjamins, Raja Chatila, and Francisco Herrera. 2020. Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible AI. Information Fusion (2020).Google ScholarGoogle Scholar
  8. Zahra Ashktorab, Q. Vera Liao, Casey Dugan, James Johnson, Qian Pan, Wei Zhang, Sadhana Kumaravel, and Murray Campbell. 2020. Human-AI Collaboration in a Cooperative Game Setting: Measuring Social Perception and Outcomes. Proc. ACM Hum.-Comput. Interact. 4, CSCW2, Article 96 (oct 2020), 20 pages. https://doi.org/10.1145/3415167Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Shahin Atakishiyev, Mohammad Salameh, Hengshuai Yao, and Randy Goebel. 2021. Explainable Artificial Intelligence for Autonomous Driving: A Comprehensive Overview and Field Guide for Future Research Directions. CoRR abs/2112.11561(2021). arXiv:2112.11561https://arxiv.org/abs/2112.11561Google ScholarGoogle Scholar
  10. Gagan Bansal, Besmira Nushi, Ece Kamar, Walter S. Lasecki, Daniel S. Weld, and Eric Horvitz. 2019. Beyond Accuracy: The Role of Mental Models in Human-AI Team Performance. In AAAI Conference on Human Computation and Crowdsourcing (HCOMP).Google ScholarGoogle Scholar
  11. Gagan Bansal, Tongshuang Wu, Joyce Zhou, Raymond Fok, Besmira Nushi, Ece Kamar, Marco Tulio Ribeiro, and Daniel Weld. 2021. Does the Whole Exceed Its Parts? The Effect of AI Explanations on Complementary Team Performance. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 81, 16 pages. https://doi.org/10.1145/3411764.3445717Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2017. Network Dissection: Quantifying Interpretability of Deep Visual Representations. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle Scholar
  13. David Bau, Jun-Yan Zhu, Jonas Wulff, William Peebles, Hendrik Strobelt, Bolei Zhou, and Antonio Torralba. 2019. Seeing What a GAN Cannot Generate. In International Conference on Computer Vision (ICCV).Google ScholarGoogle ScholarCross RefCross Ref
  14. Kevin Baum, Susanne Mantel, Eva Schmidt, and Timo Speith. 2022. From Responsibility to Reason-Giving Explainable Artificial Intelligence. Philosophy & Technology 35, 1 (2022), 12. https://doi.org/10.1007/s13347-022-00510-wGoogle ScholarGoogle ScholarCross RefCross Ref
  15. Umang Bhatt, Alice Xiang, Shubham Sharma, Adrian Weller, Ankur Taly, Yunhan Jia, Joydeep Ghosh, Ruchir Puri, José M. F. Moura, and Peter Eckersley. 2020. Explainable Machine Learning in Deployment. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency (Barcelona, Spain) (FAT* ’20). Association for Computing Machinery, New York, NY, USA, 648–657. https://doi.org/10.1145/3351095.3375624Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jeffrey P Bigham, Chandrika Jayant, Andrew Miller, Brandyn White, and Tom Yeh. 2010. VizWiz:: LocateIt-enabling blind people to locate objects in their environment. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops. IEEE, 65–72.Google ScholarGoogle ScholarCross RefCross Ref
  17. Wieland Brendel and Matthias Bethge. 2019. Approximating CNNs with Bag-of-local-Features Models Works Surprisingly Well on ImageNet. In International Conference on Learning Representations (ICLR).Google ScholarGoogle Scholar
  18. David Broniatowski. 2021. Psychological Foundations of Explainability and Interpretability in Artificial Intelligence. https://doi.org/10.6028/NIST.IR.8367Google ScholarGoogle ScholarCross RefCross Ref
  19. Miles Brundage, Shahar Avin, Jasmine Wang, Haydn Belfield, Gretchen Krueger, Gillian Hadfield, Heidy Khlaaf, Jingying Yang, Helen Toner, Ruth Fong, Tegan Maharaj, Pang Wei Koh, Sara Hooker, Jade Leung, Andrew Trask, Emma Bluemke, Jonathan Lebensold, Cullen O’Keefe, Mark Koren, Théo Ryffel, JB Rubinovitz, Tamay Besiroglu, Federica Carugati, Jack Clark, Peter Eckersley, Sarah de Haas, Maritza Johnson, Ben Laurie, Alex Ingerman, Igor Krawczuk, Amanda Askell, Rosario Cammarota, Andrew Lohn, David Krueger, Charlotte Stix, Peter Henderson, Logan Graham, Carina Prunkl, Bianca Martin, Elizabeth Seger, Noa Zilberman, Seán Ó hÉigeartaigh, Frens Kroeger, Girish Sastry, Rebecca Kagan, Adrian Weller, Brian Tse, Elizabeth Barnes, Allan Dafoe, Paul Scharre, Ariel Herbert-Voss, Martijn Rasser, Shagun Sodhani, Carrick Flynn, Thomas Krendl Gilbert, Lisa Dyer, Saif Khan, Yoshua Bengio, and Markus Anderljung. 2020. Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims. https://doi.org/10.48550/ARXIV.2004.07213Google ScholarGoogle ScholarCross RefCross Ref
  20. Moritz Böhle, Mario Fritz, and Bernt Schiele. 2021. Convolutional Dynamic Alignment Networks for Interpretable Classifications. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle Scholar
  21. Moritz Böhle, Mario Fritz, and Bernt Schiele. 2022. B-cos Networks: Alignment is All We Need for Interpretability. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarCross RefCross Ref
  22. Carrie J. Cai, Emily Reif, Narayan Hegde, Jason Hipp, Been Kim, Daniel Smilkov, Martin Wattenberg, Fernanda Viegas, Greg S. Corrado, Martin C. Stumpe, and Michael Terry. 2019. Human-Centered Tools for Coping with Imperfect Algorithms During Medical Decision-Making. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI ’19). Association for Computing Machinery, New York, NY, USA, 1–14. https://doi.org/10.1145/3290605.3300234Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Carrie J. Cai, Samantha Winter, David Steiner, Lauren Wilcox, and Michael Terry. 2019. "Hello AI": Uncovering the Onboarding Needs of Medical Practitioners for Human-AI Collaborative Decision-Making. Proc. ACM Hum.-Comput. Interact. 3, CSCW, Article 104 (nov 2019), 24 pages. https://doi.org/10.1145/3359206Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Chaofan Chen, Oscar Li, Daniel Tao, Alina Barnett, Cynthia Rudin, and Jonathan K Su. 2019. This Looks Like That: Deep Learning for Interpretable Image Recognition. In Neural Information Processing Systems (NeurIPS).Google ScholarGoogle Scholar
  25. Nazli Cila. 2022. Designing Human-Agent Collaborations: Commitment, Responsiveness, and Support. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 420, 18 pages. https://doi.org/10.1145/3491102.3517500Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Roberto Confalonieri, Tarek R. Besold, Tillman Weyde, Kathleen Creel, Tania Lombrozo, Shane T. Mueller, and Patrick Shafto. 2019. What makes a good explanation? Cognitive dimensions of explaining intelligent machines. In CogSci. 25–26. https://mindmodeling.org/cogsci2019/papers/0013/index.htmlGoogle ScholarGoogle Scholar
  27. Arun Das and Paul Rad. 2020. Opportunities and Challenges in Explainable Artificial Intelligence (XAI): A Survey. CoRR abs/2006.11371(2020). arXiv:2006.11371https://arxiv.org/abs/2006.11371Google ScholarGoogle Scholar
  28. Jon Donnelly, Alina Jade Barnett, and Chaofan Chen. 2022. Deformable ProtoPNet: An Interpretable Image Classifier Using Deformable Prototypes. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle Scholar
  29. Finale Doshi-Velez and Been Kim. 2017. Towards A Rigorous Science of Interpretable Machine Learning. arxiv:1702.08608 [stat.ML]Google ScholarGoogle Scholar
  30. Abhimanyu Dubey, Filip Radenovic, and Dhruv Mahajan. 2022. Scalable Interpretability via Polynomials. In Neural Information Processing Systems (NeurIPS).Google ScholarGoogle Scholar
  31. Salvatore D’Avella, Gerardo Camacho-Gonzalez, and Paolo Tripicchio. 2022. On Multi-Agent Cognitive Cooperation: Can virtual agents behave like humans?Neurocomputing 480(2022), 27–38. https://doi.org/10.1016/j.neucom.2022.01.025Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Upol Ehsan, Q. Vera Liao, Michael Muller, Mark O. Riedl, and Justin D. Weisz. 2021. Expanding Explainability: Towards Social Transparency in AI Systems. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 82, 19 pages. https://doi.org/10.1145/3411764.3445188Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Upol Ehsan, Samir Passi, Q. Vera Liao, Larry Chan, I-Hsiang Lee, Michael J. Muller, and Mark O. Riedl. 2021. The Who in Explainable AI: How AI Background Shapes Perceptions of AI Explanations. CoRR abs/2107.13509(2021). arXiv:2107.13509https://arxiv.org/abs/2107.13509Google ScholarGoogle Scholar
  34. Upol Ehsan and Mark O. Riedl. 2020. Human-centered Explainable AI: Towards a Reflective Sociotechnical Approach. CoRR abs/2002.01092(2020). arXiv:2002.01092https://arxiv.org/abs/2002.01092Google ScholarGoogle Scholar
  35. Upol Ehsan, Philipp Wintersberger, Q. Vera Liao, Martina Mara, Marc Streit, Sandra Wachter, Andreas Riener, and Mark O. Riedl. 2021. Operationalizing Human-Centered Perspectives in Explainable AI. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI EA ’21). Association for Computing Machinery, New York, NY, USA, Article 94, 6 pages. https://doi.org/10.1145/3411763.3441342Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Upol Ehsan, Philipp Wintersberger, Q. Vera Liao, Elizabeth Anne Watkins, Carina Manger, Hal Daumé III, Andreas Riener, and Mark O Riedl. 2022. Human-Centered Explainable AI (HCXAI): Beyond Opening the Black-Box of AI. In Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI EA ’22). Association for Computing Machinery, New York, NY, USA, Article 109, 7 pages. https://doi.org/10.1145/3491101.3503727Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Madeleine Clare Elish and Elizabeth Anne Watkins. 2020. Repairing innovation: A study of integrating AI in clinical care. Data & Society (2020).Google ScholarGoogle Scholar
  38. Thomas Fel, Julien Colin, Rémi Cadène, and Thomas Serre. 2021. What I Cannot Predict, I Do Not Understand: A Human-Centered Evaluation Framework for Explainability Methods. arxiv:2112.04417 [s.CV]Google ScholarGoogle Scholar
  39. Andrea Ferrario and Michele Loi. 2022. How Explainability Contributes to Trust in AI. In 2022 ACM Conference on Fairness, Accountability, and Transparency (Seoul, Republic of Korea) (FAccT ’22). Association for Computing Machinery, New York, NY, USA, 1457–1466. https://doi.org/10.1145/3531146.3533202Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Riccardo Fogliato, Shreya Chappidi, Matthew Lungren, Paul Fisher, Diane Wilson, Michael Fitzke, Mark Parkinson, Eric Horvitz, Kori Inkpen, and Besmira Nushi. 2022. Who Goes First? Influences of Human-AI Workflow on Decision Making in Clinical Imaging. In 2022 ACM Conference on Fairness, Accountability, and Transparency (Seoul, Republic of Korea) (FAccT ’22). Association for Computing Machinery, New York, NY, USA, 1362–1374. https://doi.org/10.1145/3531146.3533193Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Ruth Fong. 2020. Understanding convolutional neural networks. Ph. D. Dissertation. University of Oxford.Google ScholarGoogle Scholar
  42. Ruth Fong, Mandela Patrick, and Andrea Vedaldi. 2019. Understanding Deep Networks via Extremal Perturbations and Smooth Masks. In International Conference on Computer Vision (ICCV).Google ScholarGoogle ScholarCross RefCross Ref
  43. Ruth Fong and Andrea Vedaldi. 2017. Interpretable Explanations of Black Boxes by Meaningful Perturbation. In International Conference on Computer Vision (ICCV).Google ScholarGoogle ScholarCross RefCross Ref
  44. Ruth Fong and Andrea Vedaldi. 2018. Net2Vec: Quantifying and Explaining how Concepts are Encoded by Filters in Deep Neural Networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarCross RefCross Ref
  45. Julie Gerlings, Millie Søndergaard Jensen, and Arisa Shollo. 2021. Explainable AI, but explainable to whom?CoRR abs/2106.05568(2021). arXiv:2106.05568https://arxiv.org/abs/2106.05568Google ScholarGoogle Scholar
  46. Leilani H. Gilpin, David Bau, Ben Z Yuan, Ayesha Bajwa, Michael Specter, and Lalana Kagal. 2018. Explaining explanations: An overview of interpretability of machine learning. In The 5th IEEE International Conference on Data Science and Advanced Analytics (DSAA).Google ScholarGoogle ScholarCross RefCross Ref
  47. Yash Goyal, Ziyan Wu, Jan Ernst, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Counterfactual Visual Explanations. In International Conference on Machine Learning (ICML).Google ScholarGoogle Scholar
  48. Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and Dino Pedreschi. 2018. A Survey of Methods for Explaining Black Box Models. ACM Comput. Surv. 51, 5, Article 93 (aug 2018), 42 pages. https://doi.org/10.1145/3236009Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. David Gunning and David Aha. 2019. DARPA’s Explainable Artificial Intelligence (XAI) Program. AI Magazine (2019).Google ScholarGoogle Scholar
  50. Robert R. Hoffman, Gary Klein, and Shane T. Mueller. 2018. Explaining Explanation For “Explainable AI”. Proceedings of the Human Factors and Ergonomics Society Annual Meeting 62, 1(2018), 197–201. https://doi.org/10.1177/1541931218621047 arXiv:https://doi.org/10.1177/1541931218621047Google ScholarGoogle ScholarCross RefCross Ref
  51. Adrian Hoffmann, Claudio Fanconi, Rahul Rade, and Jonas Kohler. 2021. This Looks Like That... Does it? Shortcomings of Latent Space Prototype Interpretability in Deep Networks. In International Conference on Machine Learning (ICML) Workshop on Theoretic Foundation, Criticism, and Application Trend of Explainable AI.Google ScholarGoogle Scholar
  52. Sungsoo Ray Hong, Jessica Hullman, and Enrico Bertini. 2020. Human Factors in Model Interpretability: Industry Practices, Challenges, and Needs. Proc. ACM Hum.-Comput. Interact. 4, CSCW1, Article 68 (may 2020), 26 pages. https://doi.org/10.1145/3392878Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Shagun Jhaver, Iris Birman, Eric Gilbert, and Amy Bruckman. 2019. Human-Machine Collaboration for Content Regulation: The Case of Reddit Automoderator. ACM Trans. Comput.-Hum. Interact. 26, 5, Article 31 (jul 2019), 35 pages. https://doi.org/10.1145/3338243Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Atoosa Kasirzadeh. 2021. Reasons, Values, Stakeholders: A Philosophical Framework for Explainable Artificial Intelligence. CoRR abs/2103.00752(2021). arXiv:2103.00752https://arxiv.org/abs/2103.00752Google ScholarGoogle Scholar
  55. Been Kim, Emily Reif, Martin Wattenberg, Samy Bengio, and Michael C. Mozer. 2021. Neural Networks Trained on Natural Scenes Exhibit Gestalt Closure. Computational Brain & Behavior(2021).Google ScholarGoogle Scholar
  56. Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, 2018. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCAV). In ICML.Google ScholarGoogle Scholar
  57. Sunnie S. Y. Kim, Nicole Meister, Vikram V. Ramaswamy, Ruth Fong, and Olga Russakovsky. 2022. HIVE: Evaluating the Human Interpretability of Visual Explanations. In European Conference on Computer Vision (ECCV).Google ScholarGoogle Scholar
  58. Pang Wei Koh and Percy Liang. 2017. Understanding Black-box Predictions via Influence Functions. In International Conference on Machine Learning (ICML).Google ScholarGoogle Scholar
  59. Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. 2020. Concept Bottleneck Models. In International Conference on Machine Learning (ICML).Google ScholarGoogle Scholar
  60. Ranjay Krishna, Donsuk Lee, Li Fei-Fei, and Michael S. Bernstein. 2022. Socially situated artificial intelligence enables learning from human interaction. Proceedings of the National Academy of Sciences 119, 39(2022), e2115730119. https://doi.org/10.1073/pnas.2115730119 arXiv:https://www.pnas.org/doi/pdf/10.1073/pnas.2115730119Google ScholarGoogle ScholarCross RefCross Ref
  61. Shitij Kumar, Celal Savur, and Ferat Sahin. 2021. Survey of Human–Robot Collaboration in Industrial Settings: Awareness, Intelligence, and Compliance. IEEE Transactions on Systems, Man, and Cybernetics: Systems 51, 1(2021), 280–297. https://doi.org/10.1109/TSMC.2020.3041231Google ScholarGoogle ScholarCross RefCross Ref
  62. Vivian Lai, Samuel Carton, Rajat Bhatnagar, Q. Vera Liao, Yunfeng Zhang, and Chenhao Tan. 2022. Human-AI Collaboration via Conditional Delegation: A Case Study of Content Moderation. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 54, 18 pages. https://doi.org/10.1145/3491102.3501999Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Vivian Lai and Chenhao Tan. 2019. On Human Predictions with Explanations and Predictions of Machine Learning Models: A Case Study on Deception Detection. In Proceedings of the Conference on Fairness, Accountability, and Transparency (Atlanta, GA, USA) (FAT* ’19). Association for Computing Machinery, New York, NY, USA, 29–38. https://doi.org/10.1145/3287560.3287590Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Markus Langer, Daniel Oster, Timo Speith, Holger Hermanns, Lena Kästner, Eva Schmidt, Andreas Sesing, and Kevin Baum. 2021. What Do We Want From Explainable Artificial Intelligence (XAI)? - A Stakeholder Perspective on XAI and a Conceptual Model Guiding Interdisciplinary XAI Research. CoRR abs/2102.07817(2021). arXiv:2102.07817https://arxiv.org/abs/2102.07817Google ScholarGoogle Scholar
  65. Mina Lee, Percy Liang, and Qian Yang. 2022. CoAuthor: Designing a Human-AI Collaborative Writing Dataset for Exploring Language Model Capabilities. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 388, 19 pages. https://doi.org/10.1145/3491102.3502030Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Q. Vera Liao, Daniel Gruen, and Sarah Miller. 2020. Questioning the AI: Informing Design Practices for Explainable AI User Experiences. Association for Computing Machinery, New York, NY, USA, 1–15. https://doi.org/10.1145/3313831.3376590Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Q. Vera Liao and Kush R. Varshney. 2021. Human-Centered Explainable AI (XAI): From Algorithms to User Experiences. CoRR abs/2110.10790(2021). arXiv:2110.10790https://arxiv.org/abs/2110.10790Google ScholarGoogle Scholar
  68. Q. Vera Liao, Yunfeng Zhang, Ronny Luss, Finale Doshi-Velez, and Amit Dhurandhar. 2022. Connecting Algorithmic Research and Usage Contexts: A Perspective of Contextualized Evaluation for Explainable AI. https://doi.org/10.48550/ARXIV.2206.10847Google ScholarGoogle ScholarCross RefCross Ref
  69. Zachary C. Lipton. 2018. The Mythos of Model Interpretability: In Machine Learning, the Concept of Interpretability is Both Important and Slippery.Queue (2018).Google ScholarGoogle Scholar
  70. Ryan Louie, Andy Coenen, Cheng Zhi Huang, Michael Terry, and Carrie J. Cai. 2020. Novice-AI Music Co-Creation via AI-Steering Tools for Deep Generative Models. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3313831.3376739Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Jörn Lötsch, Dario Kringel, and Alfred Ultsch. 2022. Explainable Artificial Intelligence (XAI) in Biomedicine: Making AI Decisions Trustworthy for Physicians and Patients. BioMedInformatics 2, 1 (2022), 1–17. https://doi.org/10.3390/biomedinformatics2010001Google ScholarGoogle ScholarCross RefCross Ref
  72. R. Machlev, L. Heistrene, M. Perl, K.Y. Levy, J. Belikov, S. Mannor, and Y. Levron. 2022. Explainable Artificial Intelligence (XAI) techniques for energy and power systems: Review, challenges and opportunities. Energy and AI 9(2022), 100169. https://doi.org/10.1016/j.egyai.2022.100169Google ScholarGoogle ScholarCross RefCross Ref
  73. Antonios Mamalakis, Imme Ebert-Uphoff, and Elizabeth A. Barnes. 2022. Explainable Artificial Intelligence in Meteorology and Climate Science: Model Fine-Tuning, Calibrating Trust and Learning New Science. Springer International Publishing, Cham, 315–339. https://doi.org/10.1007/978-3-031-04083-2_16Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. Andrei Margeloiu, Matthew Ashman, Umang Bhatt, Yanzhi Chen, Mateja Jamnik, and Adrian Weller. 2021. Do Concept Bottleneck Models Learn as Intended?. In International Conference on Learning Representations (ICLR) Workshop on Responsible AI.Google ScholarGoogle Scholar
  75. Aniek F Markus, Jan A Kors, and Peter R Rijnbeek. 2021. The role of explainability in creating trustworthy artificial intelligence for health care: A comprehensive survey of the terminology, design choices, and evaluation strategies. Journal of biomedical informatics 113 (2021). https://doi.org/doi:10.1016/j.jbi.2020.103655Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Nora McDonald, Sarita Schoenebeck, and Andrea Forte. 2019. Reliability and inter-rater reliability in qualitative research: Norms and guidelines for CSCW and HCI practice. Proceedings of the ACM on human-computer interaction 3, CSCW(2019), 1–23.Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. Quentin Meteier, Marine Capallera, Leonardo Angelini, Elena Mugellini, Omar Abou Khaled, Stefano Carrino, Emmanuel De Salis, Stéphane Galland, and Susanne Boll. 2019. Workshop on Explainable AI in Automated Driving: A User-Centered Interaction Approach. In Proceedings of the 11th International Conference on Automotive User Interfaces and Interactive Vehicular Applications: Adjunct Proceedings (Utrecht, Netherlands) (AutomotiveUI ’19). Association for Computing Machinery, New York, NY, USA, 32–37. https://doi.org/10.1145/3349263.3350762Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. Tim Miller. 2017. Explanation in Artificial Intelligence: Insights from the Social Sciences. CoRR abs/1706.07269(2017). arXiv:1706.07269http://arxiv.org/abs/1706.07269Google ScholarGoogle Scholar
  79. Tim Miller. 2022. Are we measuring trust correctly in explainability, interpretability, and transparency research?https://doi.org/10.48550/ARXIV.2209.00651Google ScholarGoogle ScholarCross RefCross Ref
  80. Tim Miller, Piers Howe, and Liz Sonenberg. 2017. Explainable AI: Beware of Inmates Running the Asylum Or: How I Learnt to Stop Worrying and Love the Social and Behavioural Sciences. CoRR abs/1712.00547(2017). arXiv:1712.00547http://arxiv.org/abs/1712.00547Google ScholarGoogle Scholar
  81. Mitch Waite Group. [n. d.]. iBird Pro Guide. https://apps.apple.com/us/app/ibird-pro-guide-to-birds/id308018823Google ScholarGoogle Scholar
  82. Brent Mittelstadt, Chris Russell, and Sandra Wachter. 2019. Explaining Explanations in AI. In Proceedings of the Conference on Fairness, Accountability, and Transparency (Atlanta, GA, USA) (FAT* ’19). Association for Computing Machinery, New York, NY, USA, 279–288. https://doi.org/10.1145/3287560.3287574Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. Sina Mohseni, Niloofar Zarei, and Eric D. Ragan. 2018. A Survey of Evaluation Methods and Measures for Interpretable Machine Learning. CoRR abs/1811.11839(2018). arXiv:1811.11839http://arxiv.org/abs/1811.11839Google ScholarGoogle Scholar
  84. Christoph Molnar. 2022. Interpretable Machine Learning(2 ed.). https://christophm.github.io/interpretable-ml-bookGoogle ScholarGoogle Scholar
  85. Raha Moraffah, Mansooreh Karami, Ruocheng Guo, Adrienne Raglin, and Huan Liu. 2020. Causal Interpretability for Machine Learning - Problems, Methods and Evaluation. SIGKDD Explor. Newsl. 22, 1 (may 2020), 18–33. https://doi.org/10.1145/3400051.3400058Google ScholarGoogle ScholarDigital LibraryDigital Library
  86. Michael J. Muller. 2002. Participatory Design: The Third Space in HCI. L. Erlbaum Associates Inc., USA, 1051–1068.Google ScholarGoogle Scholar
  87. National Audubon Society. [n. d.]. Audubon Bird Guide. https://apps.apple.com/us/app/audubon-bird-guide/id333227386Google ScholarGoogle Scholar
  88. Meike Nauta, Ron van Bree, and Christin Seifert. 2021. Neural Prototype Trees for Interpretable Fine-Grained Image Recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle Scholar
  89. An T. Nguyen, Aditya Kharosekar, Saumyaa Krishnan, Siddhesh Krishnan, Elizabeth Tate, Byron C. Wallace, and Matthew Lease. 2018. Believe It or Not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology(Berlin, Germany) (UIST ’18). Association for Computing Machinery, New York, NY, USA, 189–199. https://doi.org/10.1145/3242587.3242666Google ScholarGoogle ScholarDigital LibraryDigital Library
  90. Giang Nguyen, Daeyoung Kim, and Anh Nguyen. 2021. The effectiveness of feature attribution methods and its correlation with automatic evaluation scores. In Neural Information Processing Systems (NeurIPS).Google ScholarGoogle Scholar
  91. Giang Nguyen, Mohammad Reza Taesiri, and Anh Nguyen. 2022. Visual correspondence-based explanations improve AI robustness and human-AI team accuracy. In Neural Information Processing Systems (NeurIPS).Google ScholarGoogle Scholar
  92. Takashi Numata, Hiroki Sato, Yasuhiro Asa, Takahiko Koike, Kohei Miyata, Eri Nakagawa, Motofumi Sumiya, and Norihiro Sadato. 2020. Achieving affective human–virtual agent communication by enabling virtual agents to imitate positive expressions. Scientific Reports 10, 1 (2020), 5977. https://doi.org/10.1038/s41598-020-62870-7Google ScholarGoogle ScholarCross RefCross Ref
  93. Uchenna Emeoha Ogenyi, Jinguo Liu, Chenguang Yang, Zhaojie Ju, and Honghai Liu. 2021. Physical Human–Robot Collaboration: Robotic Systems, Learning Methods, Collaborative Strategies, Sensors, and Actuators. IEEE Transactions on Cybernetics 51, 4 (2021), 1888–1901. https://doi.org/10.1109/TCYB.2019.2947532Google ScholarGoogle ScholarCross RefCross Ref
  94. Daniel Omeiza, Helena Webb, Marina Jirotka, and Lars Kunze. 2022. Explanations in Autonomous Driving: A Survey. IEEE Transactions on Intelligent Transportation Systems 23, 8(2022), 10142–10162. https://doi.org/10.1109/TITS.2021.3122865Google ScholarGoogle ScholarCross RefCross Ref
  95. Thomas O’Neill, Nathan McNeese, Amy Barron, and Beau Schelble. 2022. Human–Autonomy Teaming: A Review and Analysis of the Empirical Literature. Human Factors 64, 5 (2022), 904–938. https://doi.org/10.1177/0018720820960865 arXiv:https://doi.org/10.1177/0018720820960865PMID: 33092417.Google ScholarGoogle ScholarCross RefCross Ref
  96. Michael Pazzani, Severine Soltani, Robert Kaufman, Samson Qian, and Albert Hsiao. 2022. Expert-Informed, User-Centric Explanations for Machine Learning. Proceedings of the AAAI Conference on Artificial Intelligence 36, 11 (Jun. 2022), 12280–12286. https://doi.org/10.1609/aaai.v36i11.21491Google ScholarGoogle ScholarCross RefCross Ref
  97. Dino Pedreschi, Fosca Giannotti, Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, and Franco Turini. 2019. Meaningful Explanations of Black Box AI Decision Systems. Proceedings of the AAAI Conference on Artificial Intelligence 33, 01 (Jul. 2019), 9780–9784. https://doi.org/10.1609/aaai.v33i01.33019780Google ScholarGoogle ScholarDigital LibraryDigital Library
  98. Vitali Petsiuk, Abir Das, and Kate Saenko. 2018. RISE: Randomized Input Sampling for Explanation of Black-box Models. In British Machine Vision Conference (BMVC).Google ScholarGoogle Scholar
  99. Barbara Pintar. 2017. Birdadvisor 360°: A digital support for birdwatching tourism in Algarve. Ph. D. Dissertation. University of Algarve.Google ScholarGoogle Scholar
  100. Milda Pocevičiūtė, Gabriel Eilertsen, and Claes Lundström. 2020. Survey of XAI in Digital Pathology. Springer International Publishing, Cham, 56–88. https://doi.org/10.1007/978-3-030-50402-1_4Google ScholarGoogle ScholarCross RefCross Ref
  101. Forough Poursabzi-Sangdeh, Daniel G Goldstein, Jake M Hofman, Jennifer Wortman Wortman Vaughan, and Hanna Wallach. 2021. Manipulating and Measuring Model Interpretability. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 237, 52 pages. https://doi.org/10.1145/3411764.3445315Google ScholarGoogle ScholarDigital LibraryDigital Library
  102. Alun D. Preece, Dan Harborne, Dave Braines, Richard Tomsett, and Supriyo Chakraborty. 2018. Stakeholders in Explainable AI. CoRR abs/1810.00184(2018). arXiv:1810.00184http://arxiv.org/abs/1810.00184Google ScholarGoogle Scholar
  103. Filip Radenovic, Abhimanyu Dubey, and Dhruv Mahajan. 2022. Neural Basis Models for Interpretability. In Neural Information Processing Systems (NeurIPS).Google ScholarGoogle Scholar
  104. Vikram V. Ramaswamy, Sunnie S. Y. Kim, Ruth Fong, and Olga Russakovsky. 2022. Overlooked factors in concept-based explanations: Dataset choice, concept salience, and human capability. https://doi.org/10.48550/ARXIV.2207.09615Google ScholarGoogle ScholarCross RefCross Ref
  105. Vikram V. Ramaswamy, Sunnie S. Y. Kim, Nicole Meister, Ruth Fong, and Olga Russakovsky. 2022. ELUDE: Generating interpretable explanations via a decomposition into labelled and unlabelled features. https://doi.org/10.48550/ARXIV.2206.07690Google ScholarGoogle ScholarCross RefCross Ref
  106. Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. “Why Should I Trust You?": Explaining the Predictions of Any Classifier. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD).Google ScholarGoogle ScholarDigital LibraryDigital Library
  107. Cynthia Rudin, Chaofan Chen, Zhi Chen, Haiyang Huang, Lesia Semenova, and Chudi Zhong. 2022. Interpretable machine learning: Fundamental principles and 10 grand challenges. Statistics Surveys 16, none (2022), 1 – 85. https://doi.org/10.1214/21-SS133Google ScholarGoogle ScholarCross RefCross Ref
  108. Johnny Saldaña. 2021. The coding manual for qualitative researchers. The coding manual for qualitative researchers (2021), 1–440.Google ScholarGoogle Scholar
  109. Wojciech Samek, Grégoire Montavon, Andrea Vedaldi, Lars Kai Hansen, and Klaus-Robert Müller (Eds.). 2019. Explainable AI: Interpreting, Explaining and Visualizing Deep Learning. Lecture Notes in Computer Science, Vol. 11700. Springer.Google ScholarGoogle ScholarDigital LibraryDigital Library
  110. Lindsay Sanneman and Julie A. Shah. 2020. A Situation Awareness-Based Framework for Design and Evaluation of Explainable AI. In Explainable, Transparent Autonomous Agents and Multi-Agent Systems, Davide Calvaresi, Amro Najjar, Michael Winikoff, and Kary Främling (Eds.). Springer International Publishing, Cham, 94–110.Google ScholarGoogle Scholar
  111. Nicolas Scharowski, Sebastian A. C. Perrig, Nick von Felten, and Florian Brühlmann. 2022. Trust and Reliance in XAI – Distinguishing Between Attitudinal and Behavioral Measures. https://doi.org/10.48550/ARXIV.2203.12318Google ScholarGoogle ScholarCross RefCross Ref
  112. Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. Grad-CAM: Visual Explanations From Deep Networks via Gradient-Based Localization. In International Conference on Computer Vision (ICCV).Google ScholarGoogle ScholarCross RefCross Ref
  113. Mark Sendak, Madeleine Clare Elish, Michael Gao, Joseph Futoma, William Ratliff, Marshall Nichols, Armando Bedoya, Suresh Balu, and Cara O’Brien. 2020. "The Human Body is a Black Box": Supporting Clinical Decision-Making with Deep Learning. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency (Barcelona, Spain) (FAT* ’20). Association for Computing Machinery, New York, NY, USA, 99–109. https://doi.org/10.1145/3351095.3372827Google ScholarGoogle ScholarDigital LibraryDigital Library
  114. Hua Shen and Ting-Hao Kenneth Huang. 2020. How Useful Are the Machine-Generated Interpretations to General Users? A Human Evaluation on Guessing the Incorrectly Predicted Labels. In AAAI Conference on Human Computation and Crowdsourcing (HCOMP).Google ScholarGoogle ScholarCross RefCross Ref
  115. Vivswan Shitole, Fuxin Li, Minsuk Kahng, Prasad Tadepalli, and Alan Fern. 2021. One Explanation is Not Enough: Structured Attention Graphs for Image Classification. In Neural Information Processing Systems (NeurIPS).Google ScholarGoogle Scholar
  116. Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. In International Conference on Learning Representations (ICLR) Workshops.Google ScholarGoogle Scholar
  117. Amitojdeep Singh, Sourya Sengupta, and Vasudevan Lakshminarayanan. 2020. Explainable Deep Learning Models in Medical Image Analysis. Journal of Imaging 6(2020). https://doi.org/10.3390/jimaging6060052Google ScholarGoogle ScholarCross RefCross Ref
  118. Dylan Slack, Satyapriya Krishna, Himabindu Lakkaraju, and Sameer Singh. 2022. TalkToModel: Explaining Machine Learning Models with Interactive Natural Language Conversations. arXiv (2022).Google ScholarGoogle Scholar
  119. Helen Smith. 2021. Clinical AI: opacity, accountability, responsibility and liability. AI & SOCIETY 36, 2 (2021), 535–545. https://doi.org/10.1007/s00146-020-01019-6Google ScholarGoogle ScholarDigital LibraryDigital Library
  120. Alison Smith-Renner, Ron Fan, Melissa Birchfield, Tongshuang Wu, Jordan Boyd-Graber, Daniel S. Weld, and Leah Findlater. 2020. No Explainability without Accountability: An Empirical Study of Explanations and Feedback in Interactive ML. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3313831.3376624Google ScholarGoogle ScholarDigital LibraryDigital Library
  121. Nathalie A. Smuha. 2019. The EU Approach to Ethics Guidelines for Trustworthy Artificial Intelligence. Computer Law Review International 20, 4 (2019), 97–106. https://doi.org/doi:10.9785/cri-2019-200402Google ScholarGoogle ScholarCross RefCross Ref
  122. Ramya Srinivasan and Ajay Chander. 2020. Explanation Perspectives from the Cognitive Sciences—A Survey. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, Christian Bessiere (Ed.). International Joint Conferences on Artificial Intelligence Organization, 4812–4818. https://doi.org/10.24963/ijcai.2020/670 Survey track.Google ScholarGoogle ScholarCross RefCross Ref
  123. Harini Suresh, Steven R. Gomez, Kevin K. Nam, and Arvind Satyanarayan. 2021. Beyond Expertise and Roles: A Framework to Characterize the Stakeholders of Interpretable Machine Learning and Their Needs. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 74, 16 pages. https://doi.org/10.1145/3411764.3445088Google ScholarGoogle ScholarDigital LibraryDigital Library
  124. J. Eric T. Taylor and Graham W. Taylor. 2018. Artificial cognition: How experimental psychology can help generate explainable artificial intelligence. Psychonomic Bulletin & Review 28, 2 (2018), 454–475. https://doi.org/10.3758/s13423-020-01825-5Google ScholarGoogle ScholarCross RefCross Ref
  125. The Cornell Lab of Ornithology. [n. d.]. Merlin Bird ID. https://merlin.allaboutbirds.org/Google ScholarGoogle Scholar
  126. Richard Tomsett, Dave Braines, Dan Harborne, Alun D. Preece, and Supriyo Chakraborty. 2018. Interpretable to Whom? A Role-based Model for Analyzing Interpretable Machine Learning Systems. CoRR abs/1806.07552(2018). arXiv:1806.07552http://arxiv.org/abs/1806.07552Google ScholarGoogle Scholar
  127. Sana Tonekaboni, Shalmali Joshi, Melissa D. McCradden, and Anna Goldenberg. 2019. What Clinicians Want: Contextualizing Explainable Machine Learning for Clinical End Use. In Proceedings of the 4th Machine Learning for Healthcare Conference(Proceedings of Machine Learning Research, Vol. 106), Finale Doshi-Velez, Jim Fackler, Ken Jung, David Kale, Rajesh Ranganath, Byron Wallace, and Jenna Wiens (Eds.). PMLR, 359–380. https://proceedings.mlr.press/v106/tonekaboni19a.htmlGoogle ScholarGoogle Scholar
  128. Philipp Tschandl, Christoph Rinner, Zoe Apalla, Giuseppe Argenziano, Noel Codella, Allan Halpern, Monika Janda, Aimilios Lallas, Caterina Longo, Josep Malvehy, John Paoli, Susana Puig, Cliff Rosendahl, H. Peter Soyer, Iris Zalaudek, and Harald Kittler. 2020. Human–computer collaboration for skin cancer recognition. Nature Medicine 26, 8 (2020), 1229–1234. https://doi.org/10.1038/s41591-020-0942-0Google ScholarGoogle ScholarCross RefCross Ref
  129. Simon Vandenhende, Dhruv Mahajan, Filip Radenovic, and Deepti Ghadiyaram. 2022. Making Heads or Tails: Towards Semantically Consistent Visual Counterfactuals. In European Conference on Computer Vision (ECCV).Google ScholarGoogle ScholarDigital LibraryDigital Library
  130. Valeria Villani, Fabio Pini, Francesco Leali, and Cristian Secchi. 2018. Survey on human–robot collaboration in industrial settings: Safety, intuitive interfaces and applications. Mechatronics 55(2018), 248–266. https://doi.org/10.1016/j.mechatronics.2018.02.009Google ScholarGoogle ScholarCross RefCross Ref
  131. Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. 2011. The Caltech-UCSD Birds-200-2011 dataset. Technical Report CNS-TR-2011-001. California Institute of Technology.Google ScholarGoogle Scholar
  132. Dakuo Wang, Justin D. Weisz, Michael Muller, Parikshit Ram, Werner Geyer, Casey Dugan, Yla Tausczik, Horst Samulowitz, and Alexander Gray. 2019. Human-AI Collaboration in Data Science: Exploring Data Scientists’ Perceptions of Automated AI. Proc. ACM Hum.-Comput. Interact. 3, CSCW, Article 211 (nov 2019), 24 pages. https://doi.org/10.1145/3359313Google ScholarGoogle ScholarDigital LibraryDigital Library
  133. Danding Wang, Qian Yang, Ashraf Abdul, and Brian Y. Lim. 2019. Designing Theory-Driven User-Centric Explainable AI. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI ’19). Association for Computing Machinery, New York, NY, USA, 1–15. https://doi.org/10.1145/3290605.3300831Google ScholarGoogle ScholarDigital LibraryDigital Library
  134. Pei Wang and Nuno Vasconcelos. 2020. SCOUT: Self-Aware Discriminant Counterfactual Explanations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarCross RefCross Ref
  135. Guang Yang, Qinghao Ye, and Jun Xia. 2022. Unbox the black-box for the medical explainable AI via multi-modal and multi-centre data fusion: A mini-review, two showcases and beyond. Information Fusion 77(2022), 29–52. https://doi.org/10.1016/j.inffus.2021.07.016Google ScholarGoogle ScholarDigital LibraryDigital Library
  136. Chih-Kuan Yeh, Joon Kim, Ian En-Hsu Yen, and Pradeep K Ravikumar. 2018. Representer Point Selection for Explaining Deep Neural Networks. In Neural Information Processing Systems (NeurIPS).Google ScholarGoogle Scholar
  137. Ming Yin, Jennifer Wortman Vaughan, and Hanna Wallach. 2019. Understanding the Effect of Accuracy on Trust in Machine Learning Models. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI ’19). Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/3290605.3300509Google ScholarGoogle ScholarDigital LibraryDigital Library
  138. Matthew D Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In European Conference on Computer Vision (ECCV).Google ScholarGoogle ScholarCross RefCross Ref
  139. Wencan Zhang and Brian Y Lim. 2022. Towards Relatable Explainable AI with the Perceptual Process. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 181, 24 pages. https://doi.org/10.1145/3491102.3501826Google ScholarGoogle ScholarDigital LibraryDigital Library
  140. Yunfeng Zhang, Q. Vera Liao, and Rachel K. E. Bellamy. 2020. Effect of Confidence and Explanation on Accuracy and Trust Calibration in AI-Assisted Decision Making. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency (Barcelona, Spain) (FAT* ’20). Association for Computing Machinery, New York, NY, USA, 295–305. https://doi.org/10.1145/3351095.3372852Google ScholarGoogle ScholarDigital LibraryDigital Library
  141. Yiming Zhang, Ying Weng, and Jonathan Lund. 2022. Applications of Explainable Artificial Intelligence in Diagnosis and Surgery. Diagnostics 12, 2 (2022). https://doi.org/10.3390/diagnostics12020237Google ScholarGoogle ScholarCross RefCross Ref
  142. Zheng Zhang, Ying Xu, Yanhao Wang, Bingsheng Yao, Daniel Ritchie, Tongshuang Wu, Mo Yu, Dakuo Wang, and Toby Jia-Jun Li. 2022. StoryBuddy: A Human-AI Collaborative Chatbot for Parent-Child Interactive Storytelling with Flexible Parental Involvement. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 218, 21 pages. https://doi.org/10.1145/3491102.3517479Google ScholarGoogle ScholarDigital LibraryDigital Library
  143. Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. 2016. Learning Deep Features for Discriminative Localization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle Scholar
  144. Bolei Zhou, Yiyou Sun, David Bau, and Antonio Torralba. 2018. Interpretable basis decomposition for visual explanation. In European Conference on Computer Vision (ECCV).Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. "Help Me Help the AI": Understanding How Explainability Can Support Human-AI Interaction

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format