A (Short) Study in Sherlock

A little website, for the purpose of a concise exercise in the digital humanities.

[index] [python code] / [jump to notes] [jump to figures]

Author's note: This website runs best in Chrome on a desktop or laptop PC. It does function in Chrome on mobile, but this has not been extensively tested. Scrollbars are hidden, but every page is scrollable. I have made every effort to ensure the site runs, but, if it fails to, a version in a word document is provided.

A (Short) Study in Sherlock.

The Sherlock Holmes mythos has always been popular. From the original depiction of the detective on screen in 1900 to BBC's Sherlock, the series has remained prominent in cultural production. For this analysis, however, I would like to turn to where things started: the original Sherlock Holmes canon, written by Arthur Conan Doyle, consisting of four novels and five short-story collections. In doing so, I have attempted to use computational reading to test a hypothesis about the Sherlock Holmes series. In this reading, I intend to test whether, when writing the series, Arthur Conan Doyle settled quickly on a narrative formula and style of writing that remained consistent throughout the entire series. To do this, I will begin with one of the simplest measures to calculate computationally: the texts' type-token ratios.

Type-Token Ratio

In layperson's terms, a text's type-token ratio (TTR) is the ratio of unique words to total words. In practice, it acts as a rough measure for the level of lexical variation in a text. Here, it has been calculated as a value between one and zero, with one being the most possible lexical variation and zero being the lowest possible lexical variation. With that in mind, this is the rough progression of lexical variation across the Sherlock Holmes series.

Figure 1. A line graph depicting the type-token ratio of texts in the Sherlock Holmes canon from 1887 to 1927.

This visualisation alone reveals a very varied picture with little to infer. However, we get a different picture when we separate the novels and the short story collections.

Figure 2. A line graph depicting the type-token ratio of the Sherlock Holmes novels from 1887 to 1915.
Figure 3. A line graph depicting the type-token ratio of the Sherlock Holmes short story collections from 1892 to 1927.

These visualisations are more useful. Beginning with the novels, we see a downward trend from a higher TTR to a lower TTR, suggesting that lexical variation decreased as the series progressed. This frames the early novels as more lexically varied and, perhaps, experimental, whereas, as the series developed, Doyle may have developed a stronger idea of the series' semantic field and, thus, used less varied language. The data seems to affirm that Doyle experimented less with language as the series progressed. While the TTR cannot reveal whether the language used in the later series was consistent, the fact that the TTR decreases suggests, at the very least, that there was less experimentation as the series went on.

However, the TTR of the short story collections increases as time progresses, suggesting that Doyle's lexical variation increases over the series. This seems to suggest the inverse of the graph representing the novels. Doyle seems to experiment with the semantic field of the series over time. However, when we look at both graphs, we see the TTR (on average) converge around 1915, which is when the last Holmes novel was published. Thus, while it may seem that the short story collections become more experimental as a whole, they converge with the novels in terms of lexical variation as Doyle continues to write for the series, suggesting that Doyle settled on a level of lexical variation for both novels and short stories as he had more experience with them. Therefore, our hypothesis seems to be relatively congruent with the TTR values for the texts.

Sentiment Analysis

Another set of data points that seems to agree with our hypothesis is the development of sentiment (in the sense of positive versus negative language) across the texts or, more specifically, novels. Due to the sheer amount of them, I have been unable to run sentiment analysis on the individual short stories. I elected not to use data for the collected short stories as, unlike the novels, they do not intend to form one coherent narrative.

From the novels alone, however, we get the sense that Doyle developed, to some degree, a formula of sentiment.

Figure 4. A line graph depicting sentiment across A Study in Scarlet (1887).
Figure 5. A line graph depicting sentiment across The Sign of the Four (1890).
Figure 6. A line graph depicting sentiment across The Hound of the Baskervilles (1902).
Figure 7. A line graph depicting sentiment across The Valley of Fear (1915).

In these graphs, where sentiment (on the Y-axis) of plus one is the most positive and minus one is the most negative, we see a move, on-trend, from positive to negative language across all of the texts except The Sign of the Four. This trend recreates, to some degree, what we saw in the TTR values. The variation of the trend in the early series suggests that Doyle experimented in the early series. In the late series, however, Doyle sticks to a general trend of sentiment in language, even if the specific progression varies between novels. Thus, we may conclude that Doyle developed a formula of moving from positive to negative language (at least, from a crowdsourced contemporary perspective of the English language) after writing the first two novels. This seems to support the hypothesis further, as it suggests that Doyle did develop a formula for sentiment; it simply was not developed immediately upon the series' conception.

Natural Language Processing

To return to viewing both the novels and short stories at once, we will move on to natural language processing, used here to define parts of speech. While there is insufficient time to examine closely how every part of speech was used across the series, one striking visualisation is worth considering.

Figure 8. A grouped bar chart displaying the distribution of the word types across the texts in the Sherlock Holmes series.

In my computational processing, I only took adjectives, nouns, pronouns and verbs into account, meaning the percentage on the Y-axis here refers to what proportion of the collected data each part of speech occupies. However, as this makes this data in proportion with itself, the similarity of the distribution of parts of speech in the texts strongly suggests that Doyle used these four parts of language in similar relative quantities throughout the series. This visualisation implies that from the series' start to its end, Doyle focused on the more objective language of nouns, pronouns, and verbs rather than the more subjective language of adjectives and consistently used these in the same relative proportions. We may infer, then, that Doyle likely maintained a rigid standard for using certain parts of language throughout the whole series.

Conclusion

While it would be overstating the statistical significance of the data gathered here to say it has genuinely gathered anything conclusive about the Sherlock Holmes series, it provides some insight into how it changed or stayed the same as it progressed. It suggests that the language's relative proportion of nouns, pronouns and verbs to adjectives remained consistent across its duration. While some elements of the series changed according to the measures presented, these progressions, such as the reducing lexical variation in novels, similar lexical variation of novels and short stories in the late series, and progression in novels to consistently moving from positive to negative language, can be seen as reflections of Doyle moving from experimenting in the early series to gaining a stronger idea of what the typical Holmes story looks like in terms of lexical variation and trend of sentiment in the later texts. Therefore, at least partially, the data here suggests that the hypothesis brought to the text may be correct. The Sherlock Holmes series seems to maintain specific standards regarding parts of language across its duration. When it varies in other elements, experimentation occurs at the beginning of the series and settles down as it progresses. This may suggest that the Sherlock Holmes series was neither completely varied nor completely stagnant: some features, it seems, Doyle envisaged at the start and stuck to; others, however, were subject to experimentation until he had settled on a general idea of how a Holmes novel or short story should be.

  • Cube icon by Icons8
  • Notes

  • The raw text for my corpus comes from the ever-useful Project Gutenberg.
  • Not all the data I produced is included in this analysis. TTR data for individual short stories was not used because I could not make sense of the volume of data in the space of this analysis. NLP data for individual parts of speech was substantial enough for an analysis of its own and, thus, was not used here. This data, alongside all of the project's data, is available upon request.
  • Data obtained and produced has been cleaned somewhat, but not entirely.
  • This project is largely experimental and, thus, not perfectly coded or cleaned. Nonetheless, I hope my vision for it is apparent, even if I have not fully achieved it.

Figures.

Figure 1. A line graph depicting the type-token ratio of texts in the Sherlock Holmes canon from 1887 to 1927.
Figure 2. A line graph depicting the type-token ratio of the Sherlock Holmes novels from 1887 to 1915.
Figure 3. A line graph depicting the type-token ratio of the Sherlock Holmes short story collections from 1892 to 1927.
Figure 4. A line graph depicting sentiment across A Study in Scarlet (1887).
Figure 5. A line graph depicting sentiment across The Sign of the Four (1890).
Figure 6. A line graph depicting sentiment across The Hound of the Baskervilles (1902).
Figure 7. A line graph depicting sentiment across The Valley of Fear (1915).
Figure 8. A grouped bar chart displaying the distribution of the word types across the texts in the Sherlock Holmes series.