SherlockHolmes Part II

Barry Zeeberg [aut, cre]

2023-03-28

SherlockHolmes: An R Program to Analyze the Hidden Structure of Sherlock Holmes Stories by Statistical Pattern Analysis of Concordances

 

 

Barry Zeeberg


Background

This is the second manuscript documenting the Sherlock R language program. The first manuscript provided the motivation, methods, and preliminary results. Now we will expand the types of studies and results.

The functions of the Sherlock package are designed to allow an integrated interaction of the user with the data, in the sense that the data can be viewed at a high level encompassing a broad overview, with the option to drill down to specific detail, such as specific correlated lead terms within the concordance. These views are all generated by default, and are accessible to the user through a structured hierarchical archive (Table 1).

I had previously presented a scatter plot of fraction values as a function of the chronological order for the search pattern “Holmes,” across all 60 Sherlock Holmes stories. I will now expand that result with an additional search patterns, namely “Watson” and “Sherlock.” He was usually addressed as “Holmes” rather than by the more intimate “Sherlock,” except when addressed by his brother Mycroft, or when Watson might introduce him to someone as “Mr. Sherlock Holmes.” So it is not surprising that the fraction value (please see the previous manuscript for this and other definitions) is generally lower for the search pattern “Sherlock” as compared with “Holmes.”


Table 1. Directories, Files, Functions, and Figures


Overview: Inventory

The histogram of fraction values for the search pattern “Watson” shows the broad range of values that characterize different stories (Figure 1).

Figure 1. The histogram of fraction values for the search pattern “Watson”


The fraction value for “Holmes” tends to increase with chronological order (Figure 2). Somewhat surprisingly, the opposite is true for the more intimate “Sherlock” (Figure 3). Conan Doyle appears to have drifted away from the more intimate presentation.

Figure 2. Scatter plot of fraction values as a function of the chronological order for the search pattern “Holmes,” across all 60 Sherlock Holmes stories.

Figure 3. Scatter plot of fraction values as a function of the chronological order for the search pattern “Sherlock,” across all 60 Sherlock Holmes stories.


These can be compared more directly in an overlay plot (Figure 4).

Figure 4. Scatter plot overlay of fraction values as a function of the chronological order for the search patterns “Holmes” and “Sherlock” across all 60 Sherlock Holmes stories.


In spite of the lower fraction value for “Sherlock” vs. “Holmes,” it does appear at least once in all but 3 of the 60 stories (Figure 5). These absences occur among the chronologically latest stories. This is consistent with my earlier hypothesis that Conan Doyle appears to have drifted away from the more intimate presentation.

Figure 5. At least 1 instance of the search string in a text. Stories are in chronological order from top to bottom.


Another search string that suggests itself is “Watson,” whose chronology pretty much parallels that for “Holmes” (Figure 6).

Figure 6. Scatter plot overlay of fraction values as a function of the chronological order for the search patterns “Holmes” and “Watson” across all 60 Sherlock Holmes stories.


“Watson” appears in all but 1 story (data not shown). There are several other search strings that have a significant presence throughout the stories, but nowhere near our top 3 (Figure 7).


Figure 7. At least 1 instance of the search string in a text. Stories are in chronological order from top to bottom.


Unique among these is “The Adventure of the Empty House,” which is the only story in which all 5 of the search patterns appear. The stars must have been in alignment for that one. In second place is “The Valley of Fear,” in which 3 of the search patterns appear.

The scatter plot for the search patterns “Holmes,” “Sherlock,” and “Mycroft” across all 60 Sherlock Holmes stories shows unusually high fraction values for both “Mycroft” and “Sherlock” in one text (Figure 8).


Figure 8. Scatter plot overlay of fraction values as a function of the chronological order for the search patterns “Holmes,” “Sherlock,” and “Mycroft” across all 60 Sherlock Holmes stories.


The identity of the text “The Greek Interpreter” is retrieved from the tabulation in the archival files (Figures 9, 10).


Figure 9. A portion of the archival file for “Mycroft.”


Figure 10. A portion of the archival file for “Sherlock.”


The hypothesis is that Sherlock’s brother Mycroft would use the more intimate and familiar form of address.

To examine this in more detail, we can review the corresponding cumulative distribution overlay plot (Figure 11).
Figure 11. Cumulative distribution for search pattern “Sherlock,” “Holmes,” and “Mycroft” within “The Greek Interpreter.”


The 2 intervals in which “Sherlock” is highly expressed are from (0 to 1900) and (4200 to 5500) cumulative total words. These are the same intervals in which “Mycroft” is highly expressed. This tends to substantiate the hypothesis that Sherlock’s brother Mycroft would use the more intimate and familiar form of address.

Rolling Average

In the first manuscript, the main analytical method was the cumulative distribution analysis. This was a natural choice, as I had in hand the line-by-line fraction values for a given search pattern throughout each text. Furthermore, this analysis is exact, as it involves simply counting instances as they are encountered. However, it has the drawback that the interpretation is not as intuitively straightforward as an alternate approximate method i.e., the rolling average, that will be presented in this section.

The rolling average is a method (to interpret the distribution of fraction values) that is complementary to the cumulative distribution. Mathematically, the rolling average is the derivative of the cumulative distribution, and lends itself to a more intuitive interpretation. For example, Figures 11 and 12 both show the influence of the presence of “Mycroft” and the increased presence of “Sherlock,” at the beginning and at the end of the text.


Figure 12. Rolling average for search pattern “Sherlock,” “Holmes,” and “Mycroft” within “The Greek Interpreter.”

Literary Concordance Analysis

I am not versed in literary analysis, but I happen to know of one technique in which a key word or phrase occurs throughout a work, and the context surrounding each instance is retrieved and analyzed. In an adventure story like Sherlock Holmes, we would not expect the author to use symbolism, semiotics, etc. in the same way as a serious author e.g., André Gide or Alfred Jarry. The concordance analysis would be meaningful for Gide or Jarry, but not so much for Conan Doyle. So my use of Sherlock Holmes to demonstrate concordance analysis is to be taken as a technical example, but not a realistic literary exercise.

Although the analysis requires an expert linguist, the preliminary retrieval can be automated. For instance, we can find each occurrence of “Holmes” throughout the text(s), and retrieve a certain number of lines before and after (Figure 13).


Figure 13. Example of concordance for “Holmes” in “His Last Bow.”

This procedure essentially separates the entire text into 2 partitions, namely those lines that are retrieved (“inside”) and the remainder of the text (“outside”). For each word that occurs “inside” we can tabulate the number of occurrences, and the same for “outside.” After normalization, we can then identify words that are differentially expressed “inside.”

This can be thought of as a “fishing expedition” or “looking for a needle in a haystack.” These data would need to be subjected to a complete and valid statistical analysis.  This analysis could facilitate, but never replace, the job of the expert literary researcher. The standard type of statistic implemented here is the Chi Square p-value statistic. This involves the construction of a simple 2 x 2 contingency table. For a given discovered concordance lead term, the cells are the raw counts for (1) number of occurrences “inside”, (2) number of occurrences of all other words “inside,” (3) number of occurrences “outside,” and (4) number of occurrences of all other words “outside.”

An example tabulation of typical results is presented in Figure 14. In an analysis of this type,
Figure 14. Statistical analysis of the concordance data.

which is not hypothesis-driven, the primary concern is to discern true positive results among a potential background of false positives. This is a valuable skill, and depends on a proper interplay between the statistical analysis and a human analysis by a domain expert.

It is “reassuring” that the top hit is the search pattern itself “Holmes,” with a very low Chi Square p-value. The next nontrivial hit is “Sherlock,” presumably because the phrase “Sherlock Holmes” appeared. Another hit “name” is likely to be a false positive, although the context in which it appears would need to be reviewed before making a final decision.

GitHub repository of several completed studies
These studies can be viewed and downloaded from branches of the repository [https://github.com/barryzee/Concordance]