Computational Humanities Research 2022

Please note that all times are in Central European Time (CET)

Monday 12 December 2022 (Day 1)

(15:30-16:15: walk-in, registration)
16:15-16:30: welcome
16:30-18:00: session 1A: keynote Peter Turchin
18:00: opening reception (followed by self-paid dinner)

Tuesday 13 December 2022 (Day 2)

09:00-10:30: session 2A: Language
10:30-11:00: coffee break
11:00-12:20: session 2B: Literature I
12:30-13:30: lunch
13:30-15:00: sessions 2C: Images and Scans
15:00-15:30: coffee break
15:30-16:15: session 2D: Social Media
16:15-17:45: session 2E: poster pitch session, followed by walk-around
19:00: conference dinner (University club)

Wednesday 14 December 2022 (Day 3)

09:00-10:30: session 3A: keynote Nina Tahmasebi
10:30-11:00: coffee break
11:00-12:25: session 3B: Historical Dynamics
12:30-13:30: lunch
13:30-15:05: sessions 3C: Literature II
15:05-15:35: coffee break
15:35-17:05: session 3D: Text Classification
17:05-17:30: session 3E: award ceremony, concluding remarks

Detailed Programme

Session 2A Language

The Roots of Doubt. Fine-tuning a BERT Model to Explore a Stylistic Phenomenon

Margherita Parigini and Mike Kestemont

The narrative work of well-known Italian author Italo Calvino (1923-1985) features a phenomenon that literary critics refer to as "dubitative text": this stylistic device consciously hinders the narrative progression of a story, by questioning its own content. We report on an attempt to model the presence of dubitative text in Calvino's fictional oeuvre and examine whether this model can also be used to retrieve dubitative instances in his essayistic oeuvre. We hypothesize that precisely the category of the dubitative text yields interesting points of intersection between both writing modes. We fine-tuned a BERT model based on a manually annotated dataset and report inter-annotator scores. We situate our findings and model criticism in the current landscape of Calvino scholarship. While detecting dubitative text is challenging, our model provides fresh insights into the device's surface features.

Linguistic value construction in 18th-century London auction advertisements: a quantitative approach

Alessandra De Mulder, Lauren Fonteyn and Mike Kestemont

Georgian England was characterised by a buzzing consumer society in which advertising played a progressively important role when it came to the (linguistic) value construction surrounding material goods. Increasingly, the perceived value of goods was not only determined by the intrinsic quality of the goods, but also by the socio-commercial discourse used to characterise them. Linguistic modifiers, such as adjectives, must have played an important role in this process -- reflecting these socio-economical trends in text while also reinforcing them. Here, we focus on a diachronic corpus of over 5,000 pages of London auction advertisement pages, digitised via automated transcription and divided across four sample periods between 1742-1829. Prime methodological challenges include: (1) the noisiness of the available data because of imperfect transcription; (2) the coarseness of the available time stamps, and (3) the lack of suitable NLP software, such as lemmatizers or (shallow) syntactic parsers. Through the use of word embeddings, we try to alleviate the issue of spelling variation with reasonable success. We find that, over time, subjective or `evaluative' modifiers have become more prominent in these advertisements than their objective or `descriptive' counterparts -- but there are different temporal patterns for different types of advertised objects

Introducing Functional Diversity: A Novel Approach to Lexical Diversity in (Historical) Corpora

Folgert Karsdorp, Enrique Manjavacas and Lauren Fonteyn

The question how we can reliably estimate the lexical diversity of a particular text (collection) has often been asked by linguists and literary scholars alike. This short paper introduces a way of operationalizing functional diversity measurements by means of token-based embeddings, and argues that functional diversity is not only a practically advantageous, but also a theoretically relevant addition to the Computational Humanities Research toolkit. By means of an experiment on the historical ARCHER corpus, we show that lexical diversity at the level of functional groups is less sensitive to orthographic variation, and provides insight into an important and often disregarded dimension of vocabulary diversity in textual data.

Detecting Formulaic Language Use in Historical Administrative Corpora

Marijn Koolen and Rik Hoekstra

Historical administrative corpora are filled with jargon and formulaic expressions that were used con- sistently across many documents. Governmental decisions, notarial deeds and official charters often contain fixed expressions to ensure that the same legal aspects in different documents had the same interpretation. Such formulaic expressions can be used to identify specific elements of a document. For instance, a deed has different formulas to indicate whether it concerns the sale of property or the transferal of rights. In this paper we explore formulas as a methodological devise to structure the text of an administrative corpus and make the information contained in it better accessible. We use a data- driven method to detect potential formulaic expressions in historical corpora, that can deal with spelling variation and change and recognition errors introduced in the digitisation process. We apply this ex- ploratory technique on a corpus of almost 300,000 eighteenth-century resolutions of the States General of the Dutch Republic and find many formulaic expressions that capture relationships between the polit- ical actors involved and the decisions that were made. A first analysis suggests that many formulas can be used to add metadata to individual resolutions on various elements of the proposals and decisions that are part of each resolution.

Session 2B Literature I

`Entrez!' she called: Evaluating Language Identification Tools in English Literary Texts

Erik Ketzan and Nicolas Werner

This short paper presents work in progress on the evaluation of current language identification (LI) tools for identifying foreign language n-grams in English-language literary texts, for instance, “‘Entrez!’ she called”. We first manually annotated French and Spanish words appearing in 12,000-word text samples by F. Scott Fitzgerald and Ernest Hemingway using a TEI tag. We then split the tagged sample texts into four groups of n-grams, from unigram to tetragram, and compared the accuracy of five LI packages on correctly identifying the language of the tagged foreign-language snippets. We report that, of the packages tested, Fasttext proved most accurate for this task overall, but that methodological questions and future work remain.

Correlations between GoodReads Appreciation and the Sentiment Arc Fractality of the Grimm brothers' Fairy Tales

Yuri Bizzoni, Mads Rosendahl Thomsen, Ida Marie S. Lassen and Kristoffer Nielbo

Despite their widespread popularity, fairy tales are often overlooked when studying literary quality with quantitative approaches. We present a study on the relation between sentiment fractality and literary appreciation by testing the hypothesis that fairy tales with a good balance between unpredictability and excessive self-similarity in their sentiment narrative arcs tend to be more popular and more appreciated by audiences of readers. In short, we perform a correlation study of the degree of fractality of the fairy tales of the Grimm brothers and their current appreciation as measured by their Goodreads scores. Moreover, we look at the popularity of these fairy tales through time, determining which ones have come to form a strong ``internal canon" in the corpus of the authors and which one have fallen into relative obscurity.

One Graph to Rule them All: Using NLP and Graph Neural Networks to analyse Tolkien's Legendarium

Vincenzo Perri, Lisi Qarkaxhija, Albin Zehe, Andreas Hotho and Ingo Scholtes

Natural Language Processing and Machine Learning have considerably advanced Computational Literary Studies. Similarly, the construction of co-occurrence networks of literary characters, and their analysis using methods from social network analysis and network science, have provided insights into the micro- and macro-level structure of literary texts. Combining these perspectives, in this work we study character networks extracted from a text corpus of J.R.R. Tolkien's Legendarium. We show that this perspective helps us to analyse and visualise the narrative style that characterises Tolkien's works. Addressing character classification, embedding and co-occurrence prediction, we further investigate the advantages of state-of-the-art Graph Neural Networks over a popular word embedding method. Our results highlight the large potential of graph learning in Computational Literary Studies.

The Process of Imitatio Through Stylometric Analysis: the Case of Terence’s Eunuchus

Andrea Peverelli, Marieke van Erp and Jan Bloemendal

The Early Modern Era is at the forefront of a widespread enthusiasm for Latin works: texts from classical antiquity are given new life, widely re-printed, studied and even repeatedly staged, in the case of dramas, throughout Europe. Also, new Latin comedies are again written in quantities never seen before (at least 10,000 works published 1500 to 1800 are known). The authors themselves, within the game of literary imitation (the process of imitatio ), start to mimic the style of ancient authors, and Terence’s dramas in particular were considered the prime sources of reuse for many decades. Via a case study "the reception of Terence’s Eunuchus in Early Modern literature", we take a deep dive into the mechanisms of literary imitation. Our analysis is based on four comedy corpora in Latin, Italian, French and English, spanning roughly 3 centuries (1400-1700). To assess the problem of language shift and multi-language inter-corpora analysis, we base our experiments on translations of the Eunuchus , one for each sub-corpus. Through the use of tools drawn from the field of Stylometry, we address the topic of text reuse and textual similarities between Terence’s text and Early-Modern corpora to get a better grasp on the internal fluctuations of the imitation game between Early Modern and Classical authors.

Session 2C Images and Scans

Ground-truth Free Evaluation of HTR on Old French and Latin Medieval Literary Manuscripts

Thibault Clérice

As more and more projects openly release ground truth for handwritten text recognition (HTR), we expect the quality of automatic transcription to improve on unseen data. Getting models robust to scribal and material changes is a necessary step for specific data mining tasks. However, evaluation of HTR results requires ground truth to compare prediction statistically. In the context of modern languages, successful attempts to evaluate quality have been done using lexical features or n-grams. This, however, proves difficult in the context of spelling variation that both Old French and Latin have, even more so in the context of sometime heavily abbreviated manuscripts. We propose a new method based on deep learning where we attempt to categorize each line error rate into four error rate ranges ( 0 < 10% < 25% < 50% < 100% ) using three different encoder (GRU with Attention, BiLSTM, TextCNN). To train these models, we propose a new dataset engineering approach using early stopped model, as an alternative to rule-based fake predictions. Our model largely outperforms the n-gram approach. We also provide an example application to qualitatively analyse our classifier, using classification on new prediction on a sample of 1,800 manuscripts ranging from the 9 th century to the 15 th .

The Computational Memorability of Iconic Images

Lisa Saleh and Nanne van Noord

The perception of historic events is frequently shaped by specific images that have been ascribed an iconic status. These images are widely reproduced and recognised and can therefore be considered memorable. A question that arises given such images is whether the memorability of iconic images is intrinsic or whether it is shaped. In this work we analyse the memorability of iconic images by means of computational techniques that are specifically designed to measure the intrinsic memorability of images. To judge whether iconic images are inherently more memorable we establish two baselines based on datasets of diverse imagery and of newspaper imagery. Our findings show that iconic images are not more memorable than modern day newspaper imagery or when compared to a diverse set of everyday images. In fact, by and large many of the iconic images analysed score on the low end of the memorability spectrum. Additionally, we explore the variation in memorability of reproductions of iconic images and find that certain images have been edited resulting in higher memorability scores, but that the images by and large are reproduced with memorability close to the original.

Page Layout Analysis of Text-heavy Historical Documents: a Comparison of Textual and Visual Approaches

Sven Najem-Meyer and Matteo Romanello

Page layout analysis is a fundamental step in document processing which enables to segment a page into regions of interest. With highly complex layouts and mixed scripts, scholarly commentaries are text-heavy documents which remain challenging for state-of-the-art models. Their layout considerably varies across editions and their most important regions are mainly defined by semantic rather than graphical characteristics such as position or appearance. This setting calls for a comparison between textual, visual and hybrid approaches. We therefore assess the performances of two transformers (LayoutLMv3 and RoBERTa) and an objection-detection network (YOLOv5). If results show a clear advantage in favor of the latter, we also list several caveats to this finding. In addition to our experiments, we release a dataset of ca. 300 annotated pages sampled from 19 th century commentaries.

Automatic Identification and Classification of Portraits in a Corpus of Historical Photographs

Taylor Arnold, Lauren Tilton and Justin Wigard

There have been recent calls for an increased focus on the application of computer vision to the study and curation of digitised cultural heritage materials. In this short paper, we present an approach to bridge the gap between existing algorithms and humanistically driven annotations through a case study in which we create an algorithm to detect and and classify portrait photography. We apply this method to a collection of about 40,000 photographs and present a preliminary analysis of the constructed data. The work is part of the larger ongoing study that applies computer vision to the computational analysis of over a million U.S. documentary photographs from the early twentieth century.

Emodynamics: Detecting and Characterizing Pandemic Sentiment Change Points on Danish Twitter

Rebekah Baglini, Sara Møller Østergaard, Stine Nyhus Larsen and Kristoffer Nielbo

In this paper, we present the results of an initial experiment using emotion classifications as the basis for studying information dynamics in social media (`emodynamics'). To do this, we used Bert Emotion to assign probability scores for eight different emotions to each text in a time series of 43 million Danish tweets from 2019-2022. We find that variance in the information signals novelty and resonance reliably identify seasonal shifts in posting behavior, particularly around the Christmas holiday season, whereas variance in the distribution of emotion scores corresponds to more local events such as major inflection points in the Covid-19 pandemic in Denmark. This work in progress suggests that emotion scores are a useful tool for diagnosing shifts in the baseline information state of social media platforms such as Twitter, and for understanding how social media systems respond to both predictable and unexpected external events.

Differentiating Social Media Texts via Clustering

Hannah Seemann and Tatjana Scheffler

We propose to use clustering of documents based on their fine-grained linguistic properties in order to capture and validate text type distinctions such as medium and register. Correlating the bottom-up, linguistic feature driven clustering with text type distinctions (medium and register) enables us to quantify the influence of individual author choice and medium/register conventions on variable linguistic phenomena. Our pilot study applies the method to German particles and intensifiers in a multimedia corpus, annotated for register. We show that German particles and intensifiers differ across both register and medium. The clustering based on the linguistic features most closely corresponds to the medium distinction, while the stratification into registers is reflected to a lesser extent.

Right-wing Mnemonics

Phillip Stenmann Baun and Kristoffer Nielbo

This paper presents a natural language processing technique for studying memory on the far-right political discussion forum /pol/ on 4chan.org. Memory and the use of history play a pivotal role on the far-right for temporally structuring beliefs about social life and order. However, due in part to methodological limitations, there is a lack of knowledge regarding the specific historical entities that make up the far-right memory culture and wider historiography. To better grasp the structure of far-right memory, this paper opts for a data-intensive methodology, using machine learning on a data set of approximately 66 million posts from /pol/ from 2020. 19,821 random posts were manually annotated, according to the presence of historical entities. After evaluating interrater reliability, data were used to train a naïve Bayes text classifier to learn the lexical features of so-called ``posts of memory'' ( POMs ). After parameter tuning, the model extracted from the dataset a total of 1.083.471 POMs with a precision score of 98.43 \% . It is argued that this technique provides a novel way to automate the identification of historical entities within the far-right authored text, of benefit for the fields of memory studies and far-right studies, two fields that have traditionally relied on more qualitative close-reading approaches. By investigating the mnemonic features of the /pol/ posts during steps in the methodological pipeline, the paper contributes important insights into the challenges of identifying and classifying lexical features in hyper-vernacular digital spaces like 4chan, where communication is highly defined by intertextuality, semantic ambiguity, and cacography.

Session 3B Historical dynamics

Lost Manuscripts and Extinct Texts: A Dynamic Model of Cultural Transmission

Jean-Baptiste Camps and Julien Randon-Furling

How did written works evolve, disappear or survive down through the ages? In this paper, we propose a unified, formal framework for two fundamental questions in the study of the transmission of texts: how much was lost or preserved from all works of the past, and why do their genealogies (their ``phylogenetic trees'') present the very peculiar shapes that we observe or, more precisely, reconstruct? We argue here that these questions share similarities to those encountered in evolutionary biology, and can be described in terms of ``genetic'' drift and ``natural'' selection. Through agent-based models, we show that such properties as have been observed by philologists since the 1800s can be simulated, and confronted to data gathered for ancient and medieval texts across Europe, in order to obtain plausible estimations of the number of works and manuscripts that existed and were lost.

What Shall We Do With the Unseen Sailor? Estimating the Size of the Dutch East India Company Using an Unseen Species Model

Melvin Wevers, Folgert Karsdorp and Jelle van Lottum

Historians base their inquiries on the sources that are available to them. However, not all sources that are relevant to the historian's inquiry may have survived the test of time. Consequently, the resulting data can be biased in unknown ways, possibly skewing analyses. This paper deals with the Dutch East India Company its digitized ledgers of contracts. We apply an unseen species model, a method from ecology, to estimate the actual number of unique seafarers contracted. We find that the lower bound of actual seafarers is much higher than what the remaining contracts indicate: at least, thirty-six percent of the seafarers is unknown. Moreover, we find that even in periods when few records survived, we can still credibly estimate a lower bound on the unique number of seafarers.

Chronicling Crises: Event Detection in Early Modern Chronicles from the Low Countries

Alie Lassche, Jan Kostkan and Kristoffer Nielbo

Between the Middle Ages and the nineteenth century, many middle-class Europeans kept a handwritten chronicle, in which they reported on events they considered relevant. Discussed topics varied from records of price fluctuations to local politics, and from weather reports to remarkable gossip. What we do not know yet, is to what extent times of conflict and crises influenced the way in which people dealt with information. We have applied methods from information theory -- dynamics in word usage and measures of relative entropy such as novelty and resonance -- to a corpus of early modern chronicles from the Low Countries (1500--1820) to provide more insight in the way early modern people were coping with information during impactful events. We detect three peaks in the novelty signal, which coincide with times of political uncertainty in the Northern and Southern Netherlands. Topic distributions provided by Top2Vec show that during these times, chroniclers tend to write more and more extensively about an increased variation of topics.

Measuring Rhythm Regularity in Verse: Entropy of Inter-Stress Intervals

Artjoms Šeļa and Mikhail Gronas

Recognition of poetic meters is not a trivial task, since metrical labels are not a closed set of classes. Outside of classical meters, describing the metrical structure of a poem in a large corpus requires expertise and a shared scientific theory. In a situation when both components are lacking, alternative and continuous measures of regularity can be envisioned. This paper focuses on poetic rhythm to propose a simple entropy-based measure for poem regularity using counts of non-stressed intervals. The measure is validated using subsets of a well-annotated Russian poetic corpus, prose, and quasi-poems (prose chopped into lines). The regularity measure is able to detect a clear difference between various organizational principles of texts: average entropy rises when moving from accentual-syllabic meters to accentual variations to free verse and prose. Interval probabilities, when taken as a vector of features, also allow for classification at the level of individual poems. This paper argues that distinguishing between meter as a cultural idea and rhythm as an empirical sequence of sounds can lead to better understanding of form recognition and prosodic annotation problems.

Detecting Sequential Genre Change in Eighteenth-Century Texts

Jinbin Zhang, Yann Ciarán Ryan, Iiro Rastas, Filip Ginter, Mikko Tolonen and Rohit Babbar

Machine classification of historical books into genres is a common task for NLP-based classifiers and has a number of applications, from literary analysis to information retrieval. However it is not a straightforward task, as genre labels can be ambiguous and subject to temporal change, and moreoever many books consist of mixed or miscellaneous genres. In this paper we describe a work-in-progress method by which genre predictions can be used to determine longer sequences of genre change within books, which we test out with visualisations of some hand-picked texts. We apply state-of-the-art methods to the task, including a BERT-based transformer and character-level Perceiver model, both pre-trained on a large collection of eighteenth century works (ECCO), using a new set of hand-annotated documents created to reflect historical divisions. Results show that both models perform significantly better than a linear baseline, particularly when ECCO-BERT is combined with tfidf features, though for this task the character-level model provides no obvious advantage. Initial evaluation of the genre sequence method shows it may in the future be useful in determining and dividing the multiple genres of miscellaneous and hybrid historical texts.

Session 3C Literature II

Gender and Power in Japanese Light Novels

Xiaoyun Gong, Yuxi Lin, Ye Ding and Lauren Klein

In Japanese culture, the light novel – a combination of text and anime-style illustrations–is a relatively new literary form. It derives from the broader otaku culture, which is also associated with video games, manga, cosplay, anime, and other forms of Japanese popular culture. Though the light novel lacks the global reach of some of these other genres, such as manga and anime, it nonetheless attracts millions of readers across a range of gender and age groups. While distinct subgenres of the light novel have emerged, such as romance, adventure, horror, and harem, issues of gender stereotyping, power imbalances and other forms of inequality remain strongly entrenched. These issues can be attributed to how otaku culture is rooted in heterosexual male desire. This paper offers a quantitative assessment of these issues of gender inequality. We analyze 290 light novels, scraped from the Baka-Tsuki Translation Community Wiki, in terms of the power relationships between female and male characters as they evolve over the course of each novel. We find patterns consistent with issues of gender stereotyping and power differentials. More specifically, we find that female characters consistently wield less power than male characters, especially toward the end of each novel. We find some variation in specific subgenres. We conclude with close readings of two light novels, demonstrating how a power frames approach to analyzing gender stereotypes in otaku culture augments existing work on the subject.

Reviewer Preferences and Gender Disparities in Aesthetic Judgments

Ida Marie S. Lassen, Yuri Bizzoni, Telma Peura, Mads Rosendahl Thomsen and Kristoffer Nielbo

Aesthetic preferences are considered highly subjective resulting in inherently noisy judgments of aesthetic objects, yet certain aspects of aesthetic judgment display convergent trends over time. This paper presents a study that uses literary reviews as a proxy for aesthetic judgment in order to identify systematic components that can be attributed to bias. Specifically, we find that judgments of literary quality differ across media types and display a gender bias. In newspapers, male reviewers have a same-gender preference while female reviewers show an opposite--gender preference. On the other hand, in the blogosphere female reviewers prefer female authors. While alternative accounts exist of this apparent gender disparity, we argue that it reflects a cultural gender antagonism that is necessary to take into account when doing computational assessment of aesthetics.

A Quantitative Study of Fictional Things

Andrew Piper, Sunyam Bagga

In this paper, we apply machine learning based predictive models on two large data sets of historical and contemporary fiction to better understand the role that things play in fictional writing. A large body of scholarship known as ``thing theory'' has attempted to understand the function of fictional things within literature mostly by focusing on small case studies. We provide the first-ever estimates of the distribution of different types of things in English-language fiction over the past two centuries along with experiments to model their semantic identity. Our findings suggest that the most common fictional things are structural in nature, functioning akin to narrative props. We conclude by showing how these findings pose problems for inherited theories of fictional things and propose an alternative theoretical framework, embodied cognition, as a way of understanding the predominance of structural things.

Determining Author or Reader: A Statistical Analysis of Textual Features in Children's and Adult Literature

Lindsey Geybels

Due to the nature of literary texts as being composed of words rather than numbers, they are not an obvious choice to serve as data for statistical analyses. However, with the help of computational techniques, words can be converted to numerical data and certain parts of a text can be examined on a large scale. Textual elements such as sentence length, word length and lexical diversity, which are associated by scholars on the one hand with the writing style of an individual author and on the other with the complexity of a text and the intended age of its readers, can thus be subjected to statistical evaluation. In this paper, data from little under 700 English and Dutch books written for different ages is analysed using a statistical linear mixed model. The results show that the textual elements studied are better qualified to detect the age of the intended reader of a text than the identity or age of the author.

Modeling Plots of Narrative Texts as Temporal Graphs

Leonard Konle and Fotis Jannidis

The paper outlines a formal model of plot (and syuzhet) for narrative texts. The basic unit are scenes and the motif repertoire instantiated in the scene. The motif repertoire consists of three sets of (closely related) elements: character stereotypes, types of verbal actions and action types. It is assumed that the motif repertoire is highly dependent on the corpus which is analyzed, in our case a corpus of romance and horror novels published as pulp fiction. The resulting information is represented in a temporal graph which in turn is used to compute relevant information on the scenes and characters. Scenes are also characterized by their valence and their arousal value. A second representation which offers with a topic model of the direct speech and the narrative text a simple proxy for the types of verbal actions and the action types is also created. To assess the ability of these information structures to indicate changes in the temporal structures three evaluation methods are used based on artificial data. We can confirm that a very abstract representation of the plot is able to do so, but contrary to our expectations the more information-rich model which makes use of the topic model is not better in doing so. The main contribution of this paper is its attempt to integrate different research proposals into one integral model. We offer a descriptive framework and a proposal for the formal model of plot, which makes it possible to identify research problems and align existing approaches.

Session 3D Text classification

Peeking Inside the DH Toolbox – Detection and Classification of Software Tools in DH Publications

Nicolas Ruth, Andreas Niekler and Manuel Burghardt

Digital tools have played an important role in Digital Humanities (DH) since its beginnings. Accordingly, a lot of research has been dedicated to the documentation of tools as well as to the analysis of their impact from an epistemological perspective. In this paper we propose a binary and a multi-class classification approach to detect and classify tools. The approach builds on state-of-the-art neural language models. We test our model on two different corpora and report the results for different parameter configurations in two consecutive experiments. In the end, we demonstrate how the models can be used for actual tool detection and tool classification tasks in a large corpus of DH journals.

Boosting Word Frequencies in Authorship Attribution

Maciej Eder

In this paper, I introduce a simple method of computing relative word frequencies for authorship attribution and similar stylometric tasks. Rather than computing relative frequencies as the number of occurrences of a given word divided by the total number of tokens in a text, I argue that a more efficient normalization factor is the total number of relevant tokens only. The notion of relevant words includes synonyms and, usually, a few dozen other words in some ways semantically similar to a word in question. To determine such a semantic background, one of word embedding models can be used. The proposed method outperforms classical most-frequent-word approaches substantially, usually by a few percentage points depending on the input settings.

What Do We Talk About When We Talk About Topic?

Joris J. van Zundert, Marijn Koolen, Julia Neugarten, Peter Boot, Willem van Hage and Ole Mussmann

We apply Top2Vec to a corpus of 10,921 novels in the Dutch language. For the purposes of our research we want to understand if our topic model may serve as a proxy for genre. We find that topics are extremely narrowly related to an existing genre classification historically created by publishers. Interestingly we also find that, notwithstanding careful vocabulary filtering as suggested by prior research, various other signals, such as author signal, stubbornly remain.

Good Omens: A Collaborative Authorship Study

Leonardo Grotti, Mona Allaert and Patrick Quick

Good Omens is a collaborative novel written by Terry Pratchett and Neil Gaiman. Rising interest in the book, amplified by the success of the recent screen adaptation, has aroused curiosity regarding its realization. We use Rolling Delta and Rolling Classify to detect stylistic signals from each author as these methods reveal authorial takeovers. The same techniques are applied to compare the screenplay of the show to the novel. The results indicate that Good Omens resembles Pratchett’s work more closely. The screenplay is correctly attributed to Gaiman, its sole author, and the comparison reveals that Gaiman may have relied less on the source material over the course of the narrative arc.

CHR2022 Programme