1 Introduction

Writers have long followed in the footsteps of their predecessors. Clark Ashton Smith was openly influenced by the style of Edgar Allan Poe (Carter 1976) [p. 110], and Lovecraft made no mystery that he was inspired by the stories of both authors while crafting his own universe. Reusing plot devices or story-stratagems is common practice in genres as diverse as Sword & Sorcery or poetry, as the three times Pullitzer winner MacLeish once put it: “a real writer learns from earlier writers the way a boy learns from an apple orchard–by stealing what he has a taste for, and can carry off” (Carter 1973) [p. 158]. In the case of fanfiction, authors reuse the official material (i.e., the ‘canon’ including original characters and settings) of one or multiple authors, while allowing significant departure in style (e.g., a horror fiction may become a romance). Fanfictions are unlike stories written by professional writers with the agreement of right-holders and a focus on sticking to the canon by developing the viewpoints of minor characters (e.g., the monster squatting the trash compactor in Star Wars (Okorafor 2017)). Fanfictions are a popular literature and transformative work in which possibly amateur writers reuse a universe without seeking approval from right-holders. Motivations include closing gaps in the original story (Koltochikhina and Tsepkova 2020) or recasting content as exemplified by the ‘world-queering’ practice of transforming heteronormative relationships with queer characters (Floegel 2020; Llewellyn 2022). Scholars have used fanfiction to advance writing skills (Sauro and Sundmark 2019; Leigh 2020) owing to its blurry situation between creative writing and literary criticism (Petersen-Reed 2019).

Fanfictions are not a new phenomenon. Goethe’s 18th century novel The Sorrows of Young Werther was followed by hundreds of stories that reused the characters, known as Wertheriaden (Birkhold 2019). Fanfictions related to television shows are also well established, with seminal studies such as Jenkins in the early 1990’s (Jenkins 1992). However, the internet has enabled fanfictions at a new scale, resulting in a ‘web literature’ (Koltochikhina and Tsepkova 2020). The internet enables the production of fanfiction, as authors do not need to fear censorship, legal implications, or other professional consequences (Walls-Thumma 2019). The medium also facilitates the consumption of fanfiction as stories are usually free to read, unlike officially sanctioned works that are typically bought as physical copies (Datlow 2017). Most importantly for this paper, the growing availability of online fanfiction together with the rising sophistication of Natural Language Processing (NLP) techniques has powered new interdisciplinary lines of research in social network analysis and text mining.

Previous analyses of fanfictions with NLP have served to answer a variety of questions. In this paper, we focus on research that is fully automated and performed over a large scale. For example, the recent analysis by McCloskey et al. is out of scope since its corpus was in the scale of hundreds and because NLP methods were used alongside human inspection (McCloskey et al. 2022). Several large-scale, fully automated studies from social computing researchers have investigated fanfiction platforms with respect to community engagement, for example in terms of guiding writers (e.g., mentorship, critical feedback) or fostering solidarity via creative collaborations (e.g., for marginalized groups). For example, researchers found that the reviews provided by peers (known as ‘distributed mentoring’) had a statistically significant impact on a writer’s vocabulary, as measured by lexical diversity (Frens et al. 2018). Reviews have also been examined for sentiment analysis by using a trained classification model (also known as ‘classifier’) to associate emotional responses to text. Certain characters within fanfiction stories can elicit different responses based on their actions and portrayal. Researchers found that the words surrounding mentions of characters had a statistically significant impact on readers’ emotional responses (Milli and Bamman 2016).

Classifiers have also been trained on large number of stories to make predictions, for example by inferring the next spell cast in Harry Potter fanfictions (Vilares et al. 2019). Using data up to March 2016 from the Hugo Award Winning fanfiction hosting site Archive of Our Own (AO3), Jing et al. performed the related task of regression to predict the popularity of a story as a function of the novelty of its word patterns, while controlling for metadata (e.g., age rating, warnings for sensitive elements) (Jing et al. 2019). A more recent study also examined the popularity of stories by collecting a few thousand samples from five domains (e.g., Harry Potter, Twilight), extracting characters and sentiments, and also performing a regression on popularity. This study exemplifies the challenge of predicting popular stories, as the authors concluded to have “found none of the mentioned variables relevant to the popularity of fanfictions”, with adjusted regression scores ranging from 0.14 to 0.28 (Sourati Hassan Zadeh 2022). Our study revisits the challenge of predicting popular stories by focusing on one universe to extract a large corpus and leverage modern methods that go beyond the metadata examined in prior works.

We investigate fanfictions about the TV show Supernatural, which has the largest volume of stories related to a TV show, amounting to over 253, 000 stories on AO3 and more than 126, 000 stories on Fanfiction.net, as of February 2023. The longevity of the show (15 seasons) is attributed in part to its passionate fanbase (calling itself the ‘SPNFamily’), which was noted as having the largest amount on engagement on other social media platforms (Myrick 2019). The show creator has been repeatedly ‘flattered’ by fanfiction (Damore 2019) and supportive of the “inclusive community that’s formed around watching and interacting with the show” (Frith 2015). Although scholars have argued that the relation between the series' creators and the audiences was not always productive (Guirola 2023), this relationship still resulted in incorporating some of the fandom into the show (Zubernis 2021). This contrasts with the tumultuous relations between fanfiction authors and writers in other shows (Michaud Wild 2020), or even the framing of fanfiction as a subversive act (Wang 2019). Due to the massive number of stories, engagement across platforms and interplay with the show creators, many scholarly works have been devoted to both the Supernatural show (Gonçalves 2015) and its fanfictions (Åström 2010; Flegel and Roth 2010; Tosenberger 2008; Herbig and Herrmann 2016), as well as edited volumes (Taylor and Nylander 2019; Macklem et al. 2020). Focusing on Supernatural thus allows us to contribute to an existing body of literature while leveraging a sufficient volume of data to use modern text mining techniques.

Our main contribution is to demonstrate that machine learning techniques can efficiently use high-level descriptions to correctly infer whether a fanfiction is popular four out of five times. Our demonstration rests on three consecutive steps: scrapping stories and their metadata (as in previous studies), performing feature engineering to add 24 features (e.g., number of characters, main characters, tone analysis for the main two protagonists) via Watson NLP and Google’s Natural Language API, and thoroughly optimizing a variety of classifiers (e.g., support vector machines with four types of kernels). In Sect. 2, we provide a succinct background on NLP techniques for fanfictions. The details of our three steps are provided in Sect. 3, culminating in the results shown in Sect. 4. Finally, the implications both for NLP research and fanfiction scholarship are discussed in Sect. 5.

2 Background: natural language processing and fanfiction

If the volume of data only consists of a few dozen fanfictions, then experts can manually perform an accurate thematic analysis (Table 1, bottom three rows). However, as the volume rises to thousands and even millions of stories, researchers have to accept a loss in accuracy in exchange for the ability to perform an analysis at scale. This is the classic trade-off between accuracy and volume encountered for Natural Language Processing across application fields (Galgoczy et al. 2022). Network analyses, sentiment analyses, and thematic analysis are common tools of the trade in NLP research as they serve to link entities (e.g., individuals, places, events), assess the tone of the text, and track the subject of a text; all three analyses can be performed in a single study (Sandhu et al. 2019). In the case of fanfiction, these tools have served to address five questions, which were evoked in the introduction and are elaborated upon here.

Fanfiction websites are not solely about the stories created by individuals. It is also about a community, where readers and authors provide feedback to improve an author’s writings (Frens et al. 2018; Stenger 2021). Since fans interact because of a shared interest, fanfiction communities are “prototypical examples of online affinity spaces and networks” (Cheng and Frens 2022). Fanfiction websites are thus analyzed with respect to both text and community interactions (Frens et al. 2018; Kleindienst and Schmidt 2020). To automatize this analysis, researchers have used the Measure of Textual Lexical Diversity (MTLD) at the level of chapters within each story and found that the MTLD increased with the number of reviews received (Frens et al. 2018). The idea that writers help each other as a community received further evidence in a follow-up study (Froelich et al. 2021). The authors trained the BERT classifier to recognize reviews that provided specific rather than generic feedback, and they found a moderate (albeit statistically significant) correlation between giving and receiving constructive feedback. The structure of the network inferred by relationships between fanfiction readers and authors was also analyzed in a dedicated study (Davis 2021) that leveraged 16 years of data from Fanfiction.net (28 million chapters, 177 million reviews, 10 million people). Collectively, these studies demonstrate how the topic of reviewing fanfiction combines tools of the trade such as network analysis and classifiers.

Table 1 Overview of studies on fanfiction, starting with NLP techniques and contrasting them with manual techniques in the bottom three rows

Characters are central in stories, but automatically tracking them can be arduous because they can be designated through multiple words (e.g., ‘Cynthia’, ‘She’, ‘The sorceress’). A co-reference resolution system identifies characters across multiple words, thus enabling greater textual and character analysis. Several tools have been proposed for co-reference solution, such as Yang’s use of neural networks (LSTM) trained on Jane Austen’s Sense and Sensibility (Yang 2022), or FantasyCoref, trained on Grimm’s Fairy Tales, Alice’s Adventures in Wonderland, and two stories from the Arabian Nights (Han et al. 2021). In the case of fanfiction, the specialized tool is FanfictionNLP (Yoder et al. 2021). The tool can be used to extract and attribute quotes to the right characters, thus enabling studies on how specific characters express themselves throughout a story. It can also serve to create a character network, where nodes represent the characters and edges denote relationships between characters (Schmidt et al. 2022; Labatut and Bost 2019). Such a network can show the most frequent characters and types of relationships (e.g., male–male, female–male) (Schmidt et al. 2022). In another instance, the authors used the network’s signature (e.g., eigenvectors) to infer the genre of the story (Agarwal et al. 2021).

In addition to supporting the lines of inquiry aforementioned, classifiers have been used to address many other questions in fanfiction. Classifiers support sentiment analyses, which can be performed either on the story (Kim and Klinger 2019) or on the reviews (Milli and Bamman 2016) to analyze the readers’ emotional response. Trigger warnings (e.g., sexual content, physical or verbal violence) have also been automatically assigned to stories by training BERT and a Support Vector Machine (Wolska et al. 2022). Researchers have also created classifiers to predict future spells in Harry Potter fanfictions (Vilares et al. 2019), which potentially enables action models for other fandoms or action types.

3 Methods

Our workflow is detailed in the next subsections, following the order summarized in Fig. 1.

Fig. 1
figure 1

Key steps of our process. This high-resolution figure can be zoomed in

For transparency, our scripts, curated dataset, and complete results are accessible without registration on a permanent storage in a third-party repository at https://osf.io/g3p7a/.

3.1 Data collection

Our focus is to collect fanfiction about Supernatural in English. Given this language constraint, we cannot tap into the vast Internet literature in other languages such as Chinese, where the largest platform (Cloudary Corporation) officially reports 10 million registered users per day (Lu 2016). Fanfictions in English can be found on several websites (Table 2). While Wattpad has been the subject of prior studies on fanfictions, these studies are either qualitative (Budiarto et al. 2021) or use small sample sizes of a few thousand stories (Pianzola et al. 2020). The main two sources by volume are AO3 and Fanfiction.net. This result is aligned with prior works showing that AO3 is by far the main source and has been rising (McCullough 2023), while Fanfiction.net is a secondary source with a declining market share (Fiesler and Dym 2020). Prior works have used either of these two sources (Table 1), as their terms of service are compatible with the use of computer programs to automatically download content (i.e., data scraping).

Table 2 Volume of fanfictions stories on Supernatural (in thousands) per website. We used the main two sources (AO3 and Fanfiction.net) and a sample of 79,288 stories

On October 2022, we collected Supernatural stories and metadata from both AO3 and Fanfiction.net to achieve a diverse corpus. We used the Python scraper for AO3 created by Jingyi Li and Sarah Sterman (Li et al. 2017). The scraper places one request at most every five seconds. Fanfiction.net uses Cloudflare, which tends to detect web scrapers as hostile bots. We thus used a webdriver (Selenium) to automatize the requests. To avoid creating an undue load on either website and remain within the terms of a fair use,Footnote 1 we scrapped 100,310 stories from Fanfiction.net and 72,300 from AO3. The scrapers collected metadata alongside each story. Seven features were obtained for both websites: a unique story ID, the title, URL, rating,Footnote 2 language, publication date, and word count. On AO3, we also obtained the number of ‘kudos’Footnote 3 (for popularity), number of views, and number of chapters. On Fanfiction.net, we obtained the number of reviews ‘favs’ (for popularity), the number of followers and reviews, the author, and the genre. As exemplified in log-log plots (Fig. 2), there is significant variance in the content of the stories that we collected with respect to the attention that they receive and the size of the story. In Fig. 3, we also note a small correlation between the attention that a story receives (measured by number of reviews) and the extent to which readers endorse it (as measured by ‘like’ or ‘favs’). In the context of online shopping, popular products are defined as “products with many reviews” (Heck et al. 2020), hence we also expect a correlation between measures of popularity in our context.

Fig. 2
figure 2

The distribution of number of reviews per story, a shows the number of reviews gathered on the Y-axis, and the number of stories that have gathered this many reviews on the X-axis. The histogram of number of words per story, b is the distribution of the lengths of the stories in words, hence the Y-axis is the specific length and the X-axis is the number of stories. There is a noticeable difference between stories with respect to the attention that they receive (a) and their length (b), which thus constraints other features such as number of characters, places, or dialogs. The heavy-tailed distributions in these histograms are viewed as log-log plots

Fig. 3
figure 3

Correlation between two measures of popularity: the number of reviews (X-axis) and the endorsements (likes/favs)

3.2 Feature engineering

When using traditional classification methods (detailed in the next subsection), metadata is useful but not sufficient to accurately characterize why a story is popular. We thus need to perform feature engineering to extract additional (potentially) informative features. As emphasized by Minaee and colleagues, this “reliance on the hand-crafted features requires tedious feature engineering and analysis to obtain good performance” (Minaee et al. 2021). Indeed, Sect. 2 showed that many options are available, from sentiment to character networks. It is thus common to start by becoming acquainted with the corpus, which informs the choice and configuration of tools for the ensuing automatic analysis. Four readers independently examined three stories each, with a minimum of 1,000 words, and then collectively synthesized characteristics of stories that readers appeared to like. These characteristics were related to whether the story had a summary, a disclaimer, or author notes; the number of locations, characters, and dialogs; the main character, overall emotion of the storyFootnote 4, lexical diversity (i.e., number of unique words) and average word length, and sentiments associated with the two key protagonists of the TV show (Sam and Dean Winchester).

We captured sentiments through seven dimensions for each protagonist (excitement, satisfaction, politeness and lack thereof, sympathetic, frustration, sadness). The distribution of sentiments across stories were similar for the two protagonists and centered on four positive dimensions (excited, satisfied, polite, sympathetic), while the remaining three were much less prevalent (Fig. 4). Note that these seven dimensions are not perfectly orthogonal, as evidenced by the high correlations in Fig. 5; the distributions underlying each correlation are provided online in Supplementary Figures 1 and 2. Overall, our approach added 24 engineered features to the three obtained from web scraping (Table 3). We computed the engineered features using Python libraries including Watson NLP from IBM and Natural Language API from Google. These services start to incur a significant cost at a large scale, hence we created engineered features for a sample of 79,288 stories given a target budget of \(\$5,300\).

Two of the features contain categorical data: the genre provided by the author, and the main character that we detected via NLP. Machine learning algorithms commonly require categorical data to be turned into numerical data. We adopt a common machine learning approach that utilizes “a one-hot encoding technique to convert string labels to numerical labels” (Wanda and Jie 2021). If a given feature had N categorical values, then it is replaced by N binary features which indicate the absence (0) or presence (1) of each possible value.

Table 3 Four features were obtained by the metadata during web scraping. We engineering 24 additional features. We used one-hot encoding for the two categorical features (genre, main character)
Fig. 4
figure 4

Normalized distributions of sentiments. Each sentiment is reported for Dean (left) and Sam (right). The two sentiments not included (impolite, sad) had means lower than 0.025

Fig. 5
figure 5

Pearson correlation between the seven dimensions of sentiments for the two lead characters, Dean and Sam Winchester

We also created a new binary class attribute, whose value (whether a fanfiction is popular or not) is the target of the classification process detailed in the next subsection. ‘Success’ is a fuzzy construct with a subjective interpretation, just like being ‘rich’ or ‘tall’. We set the threshold for a popular story so that the dataset is about evenly split, thus avoiding effects of data imbalance that would be caused by other thresholds. As a result, a ‘successful’ story must be liked by at least ten persons (i.e., ten or more ‘kudos’ for AO3 or ‘favs’ for Fanfiction.net), whereas the other half of stories are ‘unsuccessful’ because they have fewer than 10 likes.

3.3 Classifiers and hyper-parameter optimization

A classifier is a function that predicts a class given certain features. In our case, we use the features described in the previous subsection to predict whether a fanfiction is popular, hence we perform a binary classification. A longstanding practice in text classification is to use different algorithms to train classifiers (Kadhim 2019), in order to identify the right type of function based on performances. For example, some algorithms are well-suited when the data is linearly separable while others are able to make nonlinear cuts (Fig. 6). In addition, certain algorithms specialize in massive amounts of data or in specific data types (e.g., images). Our data consists of 79,288 rows and 25 columns (24 features and 1 class outcome) structured in a tabular format. Since the complexity and the volume of the data are not aligned with deep learning, we focus on a classic approach and employ two of the most commonly used methods for text classification (Aggarwal and Zain 2012): decision trees and support vector machines. Together, these methods cover both linear and nonlinear function hypotheses (Crutzen and Giabbanelli 2014). Decision trees work well with linear decision boundaries (Kowsari et al. 2019) and they can be trained quickly. Support Vector Machines (SVMs) have been regularly employed for classification with text as they can handle non-linear cases using the ‘kernel trick’ (Fig. 6-right); they resemble a logistic regression when using a linear separation. SVMs are among the most resource-intensive models to train (Kadhim 2019), at the exception of deep learning models. In order to provide a comprehensive set of baseline algorithms for comparison, we also include random forests (i.e., sets of decision trees), logistic regression, and a neural network with 8 layers.

Fig. 6
figure 6

A decision tree makes recursive axis-parallel cuts through the data (left). Data can be rotated prior to starting the cuts, particularly if oblique cuts were needed. A support vector machine also makes axis-parallel cuts. When data is not linearly separable, an SVM can use the ‘kernel trick’ by augmenting the number of features so that data becomes linearly separable in a higher dimensional space (right). This illustration focuses on concepts and neither uses real data nor claims to be a mathematically accurate representation of a polynomial kernel

For the decision tree, we optimized two hyper-parameters that limit the cuts that can be made and hence force a simplification of the model. Setting a maximum depth to the tree prevents too many successive cuts, which can arise when the algorithm attempts to isolate a few points and hence causes an overfit. When a decision tree makes a cut, it intuitively divides it into ‘left’ and ‘right’ sides, where different features can be selected for the next cuts. The total number of features used by a tree of depth d thus scales with \(2^d\). If we want each of our 24 features to be used potentially at least once, then we need \(2^d > 24 \times 2\) hence we can pick \(d=5\). If we want to over-provision and potentially use each feature three times, then \(d=7\) would suffice. We thus considered three values of the maximum depth to force a simplification, use each feature once, or over-provision. Raising the minimum number of samples to split also avoids creating cuts in areas that lack data. The impact on the tree was discussed in Rosso and Giabbanelli (2018). For a support vector machine, choosing the right kernel is a notoriously difficult problem (Kowsari et al. 2019) hence we considered four types of kernels. The linear kernel is most useful when data can be linearly separated, which particularly applies to text (Pillutla et al. 2020). We employed the Gaussian Radial Basis Function (RBF) kernel as it is a common alternative to a linear kernel, and the most widely used form of kernels relying on an exponential (aternatives include the Laplace RBF kernel of the Gaussian kernel). We chose the sigmoid kernel as a proxy to small neural networks (i.e., a two-layer perceptron), where other options include the hyperbolic tangent kernel. Although polynomial kernels have known limitations (Steinwart 2001), we included them since they have been used in prior works on text classification (Kalcheva et al. 2020). For each of these four kernels, we optimized multiple kernel-dependent hyper-parameters. Since training an SVM is computationally and memory expensive, a comprehensive optimization process could quickly become prohibitive and force us to use heuristics (Dudzik et al. 2021). We thus focused on binary parameters and limited the number of levels for numerical parameters.

The optimization process for the decision tree, random forest, and SVM used a grid searchFootnote 5 to consider all combinations of values listed in Table 4. Since we need to both obtain robust performance estimates and perform a grid search, we divide the data into training, testing, and validation sets. This division is conducted through a 10x10 nested cross validation, also known as a double cross-validation. That is, the data is first split into 10 parts (known as outer folds), nine of which are used for model building and one for testing, until all parts have been involved. For each of the ten instances of model building, this portion of the data is further divided into 10 parts (known as inner folds), nine of which serve to train the model and include a grid search, and the other one serving to measure the initial validation. We used the classic Adam optimizer and gradient descent for the neural network, with Binary Cross Entropy as loss function. Our optimization process provides three metrics (accuracy, precision, recall) for each of the ten outer folds, which allows to compute a confidence interval and thus estimate the robustness of the results vis-á-vis the data.

Table 4 Values of the hyper-parameters used in our optimization by grid search. C is the regularization parameter and it affects the margin of the hyperplane (a higher C leads to a smaller margin). Gamma allows data points further from the hyperplane to be taken into account (a higher gamma accounts for points closer to the hyperplane). The logistic regression has no parameter hence it is not subject to optimization, while the neural network uses a different optimization process

4 Results

Complete results available in our shared online repository show that insufficient performances were encountered when using Support Vector Machines with either a sigmoid kernel (precision and recall lower than 50%) or a polynomial kernel (recall at most 58.74% and accuracy at most 65.95%). While the neural network had sufficient accuracy (78.38%) and the best precision (80.55%), it was at the expense of very low recall (68.73%). A similar situation was encountered for the logistic regression, with a decent accuracy (76%) and precision (78%) but insufficient recall (59%). The random forest performed as well as the decision tree, as detailed by our complete results on our repository. We thus focus on the satisfactory performances produced by decision trees and Support Vector Machines with linear or RBF kernels (Table 5). Every one of these approaches had its highest score for recall, which intuitively means that models are most trustworthy when they predict that a story is popular. The SVM with RBF kernel had a commendable score for recall but underperformed the other two options by a wide margin on precision and accuracy. The best performances are obtained when using decision trees, which produce scores of approximately 80% in all categories. That is, decision trees were right four out of five times.

While the highest score in each metric may be obtained by different hyper-parameter values, it is necessary for deployment to create one model with a single set of hyper-parameter values. However, it can be arduous to find the values that provide the best performances on all metrics of interest. For example, hyper-parameter values yielding the best four performances for decision trees on recall also produced the worst four performances on precision. We recommend a decision tree with a maximum depth of 7 and minimum number of samples of 2 as it yields the best accuracy (79.51 ± 0.4), the best precision (79.02 ± 1.1), and an average recall (80.44 ± 1.2 by comparison with a minimum of 77.36 and a maximum of 83.36). These hyper-parameter values favor a tree that is able to make more cuts to isolate samples, through both its large depth and low threshold for making a cut.

Table 5 For the best three performing machine learning approaches, we report the top 3 performances with regard to each metric. Note that the top two performances for the RBF kernel had the same scores and parameter apart from shrinking. Results were averaged across the 10 outer folds

We further investigated the results obtained by our best decision tree using SHAP (SHapley Additive exPlanations) to reveal how features were related to the prediction outcomes. SHAP is a widely used tool to explain machine learning models by deconstructing their predictions into the contributions of individual features, as exemplified by recent studies using SHAP on decision trees (Rodrigo et al. 2021), including boosted trees (Nohara et al. 2022) or ensembles (Campbell et al. 2022). SHAP supports local interpretability because it helps to understand individual predictions rather than how the model works (global interpretability). The feature importance plot in Fig. 7 shows that the number of reviews and unique words are strong predictive variables, followed by stories centered on romance and comfort (as extracted from the stories’ metadata created by the authors). The whereabouts of the main two characters as captured by their emotions were not significant predictors. The direction of the effect (i.e., whether values helped to predict popular or unpopular stories) is shown in Fig. 8. Although several features have a clear direction of effect, it is important to be mindful about the magnitude of this effect. For instance, readers appear to enjoy seeing the character of Sam express frustration or sadness, but either phenomenon is relatively rare within the sample.

Fig. 7
figure 7

This global feature importance plot shows by how much each variable (from most important at the top to least important at the bottom) impacts the prediction. It does not show the direction of impact, that is, whether the model predicts that a story is popular or not; directionality is provided in Fig. 8. Note that ‘comfort’, ‘hurt’, ‘supernatural’, and ‘humor’ are tags created by the authors. Emotion is the overall valence of the story

Fig. 8
figure 8

Amount and direction by which each feature affects whether an instance is predicted to be a popular story (b) or not (a). Negative values mean that the feature value lead a model to say ‘no’. Cases in blue signify a higher feature value. Note that values refer to the raw feature values rather than SHAP values. For example, a story that scores high on romance is unlikely to be unpopular (a) and more likely be popular (b). For the most part, the two plots are symmetric at the exception of outliers (see e.g. isolated dots on bottom line). These information-dense ‘beeswarm plots’ complement the higher-level summary in Fig. 7

As shown by the SHAP values (Figs. 7 and 8), the importance of the number of reviews tells us that we cannot focus exclusively on the content of a story to know whether it will be popular: we also need to know if it attracts attention as measured by the number of reviews. By removing this feature, we exclude popularity metrics and focus on the intrinsic information contained in a story. As summarized in Table 6, removing the number of reviews can noticeably impact performances, depending on the type of machine learning algorithm. Deep neural networks, logistic regressors, decision trees, and linear support vector machines experienced a double-digit performance loss on two or more metrics. The support vector machine with a nonlinear RBF kernel had a more moderate loss and it continues to excel in terms of recall. Performances improve for models that had initially low performances (sigmoid or poly kernels), but they remain lower than alternatives. This confirms that some aspects of a story are predictive of its popularity, and knowing other popularity measures provides even greater predictive ability.

Table 6 We optimized the models after removing the number of reviews. This loss of a key predictive feature generally translates to a loss (\(\triangledown\)) in performance. For models whose performances were already low, the removal may have produced a gain (\(\triangle\)), but the performances are still low

5 Discussion

The growing volume of online fanfiction has been the subject of numerous studies, either from the perspective of text mining by using Natural Language Processing or through a qualitative lens via a manual examination (Table 1). We contribute to these efforts by using classifiers to determine the popularity of fanfiction stories regarding the show Supernatural, chosen for the large available corpus size as well as extensive scholarship, ranging from articles such as Åström (2010), Flegel and Roth (2010), Tosenberger (2008), Herbig and Herrmann (2016), Zubernis (2021) to the thesis of Guirola (2023) and edited volumes by Taylor and Nylander (2019), Macklem et al. (2020), and Wilkinson (2013). We show that it is possible to accurately predict whether a story is popular in four out of five times based on high-level features.

By using local interpretability techniques for Machine Learning (i.e., SHAP), we were able to relate specific features to the popularity of stories (Sect. 4). Our findings can be summarized in the following three takeaways. First, a large number of reviews is indicative of getting good attention. This suggests similarities in taste among the readership, as popular stories amass many more reviews. Second, fans tend to like longer stories and also those that have a wider vocabulary. We posit that it is not merely about wanting ‘more’, but an indicator of overall writing quality and efforts from the writer. Third, readers enjoy romantic stories and have mixed feelings when characters get hurt (which is a frequent occurrence on the TV show). This is noteworthy because the original show Supernatural may be categorized as action, adventure, drama, fantasy, horror, or mystery—but not as a romance. Other scholars have noted that “many fans shipped the main, male characters together” despite the initially heteronormative lens of the TV show, hence “showrunners supported that interpretation by incorporating seemingly romantic glances” that were occasionally perceived as queerbaiting (Church 2023). Our large-scale analysis thus confirms the use of Fan Fictions to complement the show by venturing into themes that it did not extensively cover.

There are several limitations to our study. First, despite a sizable corpus of dozens of thousands of stories, we do not have an extensive sample on every writing style hence we refrain from making conclusions on patterns that are visible but only appear in a few stories. For example, we observed that humor was always a success, as unpopular stories were low in humor whereas popular stories had a more marked amount of humor. The type of humor did not seem to have an impact since the distribution of effects is very one sided. However, these effects were based on a small sample size, hence they cannot be broadly generalized to all fanfiction. Although our study was based on the most studied TV show in fanfiction, we also note that our findings do not automatically generalize to other TV shows or to fanfiction in general. Additionally, similar results may have been obtained for a lower computing cost, as our training process intended to be thorough out of an abundance of caution. For instance, a nested cross validation has been characterized as ‘overzealous’ at times as its high computing cost may only provide a minor improvement in model estimates compared to the optimization procedures used by AutoML or Auto-Sklearn (Wainer and Cawley 2021). In a similar way, some of the Support Vector Machine kernels (particularly the polynomial) may have been avoided, and they were only included to be in line with prior works on text classification. Finally, we performed a binary classification in order to have a clear notion of popularity and balance the data, but popularity is a continuum rather than a dichotomy. An alternative would be to either perform a multiclass classification with multiple levels of popularity, or to treat popularity as a continuous attribute and opt for a regression.

Several years ago, fanfiction scholars already anticipated that AI chatbots would be involved alongside humans in the writing process (Lamerichs 2018). GPT-4 and other pre-trained large-scale language models (LLMs) have made it a reality, resulting in a rapid acceleration of AI as (co)writers for fanfictions (Rosenberg 2023). Our findings are particularly informative in this new environment by showing which features lead to popular stories, which may facilitate the (semi)automatic generation of such stories. For example, knowing that fans prefer longer stories with a wider vocabulary is helpful when prompting GPT-4 or other AI assistants. We note that these assistants are best at replicating what they have already seen, but our analysis shows that fans prefer stories to touch on themes that were not necessarily prevalent in the show. This information may lead to engineering prompts that explicitly ask to incorporate features (e.g., romance) that would not otherwise be generated solely by relying on the training data.

It is possible that a classifier leveraging LLMs yields higher performance measures (e.g., accuracy, precision, recall). This would be a different approach, as the entire story would be encoded (e.g., with word2vec or TF-IDF) instead of extracting specific features with a known meaning. Our goal was to transparently relate features to the popularity of stories, rather than maximize performance measures. A complementary study using the latest deep learning approaches such as DeBERTaV3 (He et al. 2021) would thus be a useful follow-up. Such methods based on Deep Neural Networks provide the state-of-the-art when the objective is to maximize a performance measure for natural language processing (Suissa et al. 2022). Since pre-trained models have been exposed to different datasets, they can encode different knowledge models and hence perform differently in a given application. A study using deep learning would thus have to employ several techniques, in the same manner as we used and optimized different algorithms. In addition, deep learning models may need fine-tuning to ensure that they are adapted to the application context, which can require human-annotated data as shown in our prior applied study with BERT (Galgoczy et al. 2022).