TidyTuesday 2025-11-18 - Sherlock Holmes

data science
tidytuesday
rstats
Published

November 22, 2025

The Complete Works of Sherlock Holmes

TidyTuesday is a weekly data visualization challenge. The details for this week can be found here.

Introduction

Analysing literature with data science and numerical methods is a fun journey. A few years ago I read Nabokov’s favorite word is mauve by Ben Blatt. The methods for quantifying literature are both interesting and compelling, and I’m interested in seeing if there are similar patterns here. Sidenote: that book was described to me as “the most NPR book ever,” and I love that.

My first step is to explore samples of the data to see if it’s usable as is.

books <- data |>
  distinct(book) |>
  mutate(publish_order = row_number())



text_sample <- data[2000:2100,]

text_sample |>
  gt() |>
  tab_header(title = "Sample of Text") |>
  opt_stylize(style = 6, color = "gray") |>
  tab_options(container.height = "300px", container.overflow.y = "scroll")
Sample of Text
book text line_num
A Study In Scarlet "'You had best tell me all about it now,' I said. 'Half-confidences 2000
A Study In Scarlet are worse than none. Besides, you do not know how much we know of 2001
A Study In Scarlet it.' 2002
A Study In Scarlet NA 2003
A Study In Scarlet "'On your head be it, Alice!' cried her mother; and then, turning to 2004
A Study In Scarlet me, 'I will tell you all, sir. Do not imagine that my agitation on 2005
A Study In Scarlet behalf of my son arises from any fear lest he should have had a hand 2006
A Study In Scarlet in this terrible affair. He is utterly innocent of it. My dread is, 2007
A Study In Scarlet however, that in your eyes and in the eyes of others he may appear to 2008
A Study In Scarlet be compromised. That however is surely impossible. His high 2009
A Study In Scarlet character, his profession, his antecedents would all forbid it.' 2010
A Study In Scarlet NA 2011
A Study In Scarlet "'Your best way is to make a clean breast of the facts,' I answered. 2012
A Study In Scarlet 'Depend upon it, if your son is innocent he will be none the worse.' 2013
A Study In Scarlet NA 2014
A Study In Scarlet "'Perhaps, Alice, you had better leave us together,' she said, and 2015
A Study In Scarlet her daughter withdrew. 'Now, sir,' she continued, 'I had no intention 2016
A Study In Scarlet of telling you all this, but since my poor daughter has disclosed it 2017
A Study In Scarlet I have no alternative. Having once decided to speak, I will tell you 2018
A Study In Scarlet all without omitting any particular.' 2019
A Study In Scarlet NA 2020
A Study In Scarlet "'It is your wisest course,' said I. 2021
A Study In Scarlet NA 2022
A Study In Scarlet "'Mr. Drebber has been with us nearly three weeks. He and his 2023
A Study In Scarlet secretary, Mr. Stangerson, had been travelling on the Continent. I 2024
A Study In Scarlet noticed a "Copenhagen" label upon each of their trunks, showing that 2025
A Study In Scarlet that had been their last stopping place. Stangerson was a quiet 2026
A Study In Scarlet reserved man, but his employer, I am sorry to say, was far otherwise. 2027
A Study In Scarlet He was coarse in his habits and brutish in his ways. The very night 2028
A Study In Scarlet of his arrival he became very much the worse for drink, and, indeed, 2029
A Study In Scarlet after twelve o'clock in the day he could hardly ever be said to be 2030
A Study In Scarlet sober. His manners towards the maid-servants were disgustingly free 2031
A Study In Scarlet and familiar. Worst of all, he speedily assumed the same attitude 2032
A Study In Scarlet towards my daughter, Alice, and spoke to her more than once in a way 2033
A Study In Scarlet which, fortunately, she is too innocent to understand. On one 2034
A Study In Scarlet occasion he actually seized her in his arms and embraced her--an 2035
A Study In Scarlet outrage which caused his own secretary to reproach him for his 2036
A Study In Scarlet unmanly conduct.' 2037
A Study In Scarlet NA 2038
A Study In Scarlet "'But why did you stand all this,' I asked. 'I suppose that you can 2039
A Study In Scarlet get rid of your boarders when you wish.' 2040
A Study In Scarlet NA 2041
A Study In Scarlet "Mrs. Charpentier blushed at my pertinent question. 'Would to God 2042
A Study In Scarlet that I had given him notice on the very day that he came,' she said. 2043
A Study In Scarlet 'But it was a sore temptation. They were paying a pound a day 2044
A Study In Scarlet each--fourteen pounds a week, and this is the slack season. I am a 2045
A Study In Scarlet widow, and my boy in the Navy has cost me much. I grudged to lose the 2046
A Study In Scarlet money. I acted for the best. This last was too much, however, and I 2047
A Study In Scarlet gave him notice to leave on account of it. That was the reason of his 2048
A Study In Scarlet going.' 2049
A Study In Scarlet NA 2050
A Study In Scarlet "'Well?' 2051
A Study In Scarlet NA 2052
A Study In Scarlet "'My heart grew light when I saw him drive away. My son is on leave 2053
A Study In Scarlet just now, but I did not tell him anything of all this, for his temper 2054
A Study In Scarlet is violent, and he is passionately fond of his sister. When I closed 2055
A Study In Scarlet the door behind them a load seemed to be lifted from my mind. Alas, 2056
A Study In Scarlet in less than an hour there was a ring at the bell, and I learned that 2057
A Study In Scarlet Mr. Drebber had returned. He was much excited, and evidently the 2058
A Study In Scarlet worse for drink. He forced his way into the room, where I was sitting 2059
A Study In Scarlet with my daughter, and made some incoherent remark about having missed 2060
A Study In Scarlet his train. He then turned to Alice, and before my very face, proposed 2061
A Study In Scarlet to her that she should fly with him. "You are of age," he said, "and 2062
A Study In Scarlet there is no law to stop you. I have money enough and to spare. Never 2063
A Study In Scarlet mind the old girl here, but come along with me now straight away. You 2064
A Study In Scarlet shall live like a princess." Poor Alice was so frightened that she 2065
A Study In Scarlet shrunk away from him, but he caught her by the wrist and endeavoured 2066
A Study In Scarlet to draw her towards the door. I screamed, and at that moment my son 2067
A Study In Scarlet Arthur came into the room. What happened then I do not know. I heard 2068
A Study In Scarlet oaths and the confused sounds of a scuffle. I was too terrified to 2069
A Study In Scarlet raise my head. When I did look up I saw Arthur standing in the 2070
A Study In Scarlet doorway laughing, with a stick in his hand. "I don't think that fine 2071
A Study In Scarlet fellow will trouble us again," he said. "I will just go after him and 2072
A Study In Scarlet see what he does with himself." With those words he took his hat and 2073
A Study In Scarlet started off down the street. The next morning we heard of Mr. 2074
A Study In Scarlet Drebber's mysterious death.' 2075
A Study In Scarlet NA 2076
A Study In Scarlet "This statement came from Mrs. Charpentier's lips with many gasps and 2077
A Study In Scarlet pauses. At times she spoke so low that I could hardly catch the 2078
A Study In Scarlet words. I made shorthand notes of all that she said, however, so that 2079
A Study In Scarlet there should be no possibility of a mistake." 2080
A Study In Scarlet NA 2081
A Study In Scarlet "It's quite exciting," said Sherlock Holmes, with a yawn. "What 2082
A Study In Scarlet happened next?" 2083
A Study In Scarlet NA 2084
A Study In Scarlet "When Mrs. Charpentier paused," the detective continued, "I saw that 2085
A Study In Scarlet the whole case hung upon one point. Fixing her with my eye in a way 2086
A Study In Scarlet which I always found effective with women, I asked her at what hour 2087
A Study In Scarlet her son returned. 2088
A Study In Scarlet NA 2089
A Study In Scarlet "'I do not know,' she answered. 2090
A Study In Scarlet NA 2091
A Study In Scarlet "'Not know?' 2092
A Study In Scarlet NA 2093
A Study In Scarlet "'No; he has a latch-key, and he let himself in.' 2094
A Study In Scarlet NA 2095
A Study In Scarlet "'After you went to bed?' 2096
A Study In Scarlet NA 2097
A Study In Scarlet "'Yes.' 2098
A Study In Scarlet NA 2099
A Study In Scarlet "'When did you go to bed?' 2100



The most remarkable thing from the sample is that each observation represents a literal line in the novel. The lines can be concatenated so that each observation is a paragraph. This will be particularly useful in understanding dialog because it is typical in literature to start a new paragraph each time a speaker changes. The NAs between lines mean we have a good boundary between paragraphs.

Additionally, I cross-referenced the novel order with the Canon of Sherlock Holmes in order to understand the order in which these works were published. I think it could be interesting to see if there are changes that occur between earlier works and later works.

paragraphs <- data |>
  mutate(paragraph = cumsum(is.na(text))) |>
  filter(!is.na(text)) |>
  group_by(book,paragraph) |>
  summarize(text = paste(text, collapse = " "))
`summarise()` has grouped output by 'book'. You can override using the
`.groups` argument.
paragraphs_sample <- paragraphs[2000:2010,]

paragraphs_sample |>
  gt() |>
  tab_header(title = "Sample of Paragraphs") |>
  opt_stylize(style = 6, color = "gray") |>
  tab_options(container.height = "300px", container.overflow.y = "scroll")
Sample of Paragraphs
paragraph text
The Adventure of Charles Augustus Milverton
7455 "What I say is true," Holmes answered. "The money cannot be found. Surely it is better for you to take the substantial sum which I offer than to ruin this woman's career, which can profit you in no way?"
7456 "There you make a mistake, Mr. Holmes. An exposure would profit me indirectly to a considerable extent. I have eight or ten similar cases maturing. If it was circulated among them that I had made a severe example of the Lady Eva I should find all of them much more open to reason. You see my point?"
7457 Holmes sprang from his chair.
7458 "Get behind him, Watson! Don't let him out! Now, sir, let us see the contents of that note-book."
7459 Milverton had glided as quick as a rat to the side of the room, and stood with his back against the wall.
7460 "Mr. Holmes, Mr. Holmes," he said, turning the front of his coat and exhibiting the butt of a large revolver, which projected from the inside pocket. "I have been expecting you to do something original. This has been done so often, and what good has ever come from it? I assure you that I am armed to the teeth, and I am perfectly prepared to use my weapons, knowing that the law will support me. Besides, your supposition that I would bring the letters here in a note-book is entirely mistaken. I would do nothing so foolish. And now, gentlemen, I have one or two little interviews this evening, and it is a long drive to Hampstead." He stepped forward, took up his coat, laid his hand on his revolver, and turned to the door. I picked up a chair, but Holmes shook his head and I laid it down again. With bow, a smile, and a twinkle Milverton was out of the room, and a few moments after we heard the slam of the carriage door and the rattle of the wheels as he drove away.
7461 Holmes sat motionless by the fire, his hands buried deep in his trouser pockets, his chin sunk upon his breast, his eyes fixed upon the glowing embers. For half an hour he was silent and still. Then, with the gesture of a man who has taken his decision, he sprang to his feet and passed into his bedroom. A little later a rakish young workman with a goatee beard and a swagger lit his clay pipe at the lamp before descending into the street. "I'll be back some time, Watson," said he, and vanished into the night. I understood that he had opened his campaign against Charles Augustus Milverton; but I little dreamed the strange shape which that campaign was destined to take.
7462 For some days Holmes came and went at all hours in this attire, but beyond a remark that his time was spent at Hampstead, and that it was not wasted, I knew nothing of what he was doing. At last, however, on a wild, tempestuous evening, when the wind screamed and rattled against the windows, he returned from his last expedition, and having removed his disguise he sat before the fire and laughed heartily in his silent inward fashion.
7463 "You would not call me a marrying man, Watson?"
7464 "No, indeed!"
7465 "You'll be interested to hear that I am engaged."



Exclamation points

The analysis from Blatt that I’m interested in replicating is his exclamation point analysis.

F. Scott Fitzgerald and Earnest Hemingway were known for their very resvered use of exclamation points. Fizgerald describes it as “Laughing at your own joke.” We can use this data to determine how well Arthur Conan Doyle follows this rule.

exclamation_points <- paragraphs |>
  mutate(exclamation_points = str_count(text, "!"))

exclamation_point_sample <- exclamation_points |>
  filter(exclamation_points > 0) |>
  ungroup() |>
  slice_sample(n = 5)

### showing some example lines as a sanity check to make sure our code is working

exclamation_point_sample |>
  gt() |>
  tab_header(title = "Sample of Lines with Exclamation Points") |>
  opt_stylize(style = 6, color = "gray") |>
  tab_options(container.height = "300px", container.overflow.y = "scroll")
Sample of Lines with Exclamation Points
book paragraph text exclamation_points
The Adventure of the Noble Bachelor 3585 "Without, however, the knowledge of pre-existing cases which serves me so well. There was a parallel instance in Aberdeen some years back, and something on very much the same lines at Munich the year after the Franco-Prussian War. It is one of these cases--but, hullo, here is Lestrade! Good-afternoon, Lestrade! You will find an extra tumbler upon the sideboard, and there are cigars in the box." 2
The Adventure of the Devil's Foot 13067 "It won't do, Watson!" said he with a laugh. "Let us walk along the cliffs together and search for flint arrows. We are more likely to find them than clues to this problem. To let the brain work without sufficient material is like racing an engine. It racks itself to pieces. The sea air, sunshine, and patience, Watson--all else will come. 1
The Adventure of the Six Napoleons 7640 "Splendid!" 1
The Man with the Twisted Lip 2582 "Why," said my wife, pulling up her veil, "it is Kate Whitney. How you startled me, Kate! I had not an idea who you were when you came in." 1
The Sign of the Four 1026 "It means murder," said he, stooping over the dead man. "Ah, I expected it. Look here!" He pointed to what looked like a long, dark thorn stuck in the skin just above the ear. 1
exclamation_points_summary <- exclamation_points |>
  group_by(book) |>
  summarize(exclamation_points = sum(exclamation_points)) |>
  left_join(books, by = "book") |>
  arrange(publish_order)

### actually counting exclamation points

exclamation_points_summary |>
  gt() |>
  tab_header(title = "Exclamation Points per Book") |>
  tab_options(container.height = "500px", container.overflow.y = "scroll") |>
  cols_label(
    book = "Book",
    exclamation_points = "Exclamation Points",
    publish_order = "Publish Order"
  ) |>
  fmt_number(columns = exclamation_points, decimals = 0)
Exclamation Points per Book
Book Exclamation Points Publish Order
A Study In Scarlet 85 1
The Sign of the Four 127 2
A Scandal in Bohemia 33 3
The Red-Headed League 16 4
A Case of Identity 13 5
The Boscombe Valley Mystery 27 6
The Five Orange Pips 17 7
The Man with the Twisted Lip 32 8
The Adventure of the Blue Carbuncle 40 9
The Adventure of the Speckled Band 34 10
The Adventure of the Engineer's Thumb 31 11
The Adventure of the Noble Bachelor 19 12
The Adventure of the Beryl Coronet 43 13
The Adventure of the Copper Beeches 40 14
Silver Blaze 36 15
The Yellow Face 12 16
The Stock-Broker's Clerk 29 17
The "Gloria Scott" 19 18
The Musgrave Ritual 7 19
The Reigate Squires 24 20
The Crooked Man 15 21
The Resident Patient 17 22
The Greek Interpreter 13 23
The Naval Treaty 34 24
The Final Problem 10 25
The Adventure of the Empty House 16 26
The Adventure of the Norwood Builder 24 27
The Adventure of the Dancing Men 17 28
The Adventure of the Solitary Cyclist 48 29
The Adventure of the Priory School 36 30
The Adventure of Black Peter 10 31
The Adventure of Charles Augustus Milverton 24 32
The Adventure of the Six Napoleons 17 33
The Adventure of the Three Students 23 34
The Adventure of the Golden Pince-Nez 36 35
The Adventure of the Missing Three-Quarter 34 36
The Adventure of the Abbey Grange 21 37
The Adventure of the Second Stain 49 38
The Hound of the Baskervilles 182 39
The Valley Of Fear 319 40
The Adventure of Wisteria Lodge 12 41
The Adventure of the Cardboard Box 20 42
The Adventure of the Red Circle 46 43
The Adventure of the Bruce-Partington Plans 43 44
The Adventure of the Dying Detective 45 45
The Disappearance of Lady Frances Carfax 40 46
The Adventure of the Devil's Foot 18 47
His Last Bow 23 48



Noteably the 4 highest counts are full novels, while the rest are short stories. It makes sense that they are higher, but they will skew the rest of the analysis.

In graphical form:

novels <- c("A Study In Scarlet",
            "The Sign of the Four",
            "The Hound of the Baskervilles",
            "The Valley Of Fear" )

exclamation_points_summary |>
  filter(!(book %in% novels)) |>
  ggplot(aes(x = publish_order, y = exclamation_points)) +
  geom_point() +
  labs(
    x = "Publish Order",
    y = "Exclamation Points",
    title = "Exclamation Points per Book"
  )



Graphically, it does not look like Doyle’s use of exclamation points changed over time. We can verify this mathematically.

exclamation_points_summary |>
  filter(!(book %in% novels)) |>
  lm(exclamation_points ~ publish_order, data = _) |>
  summary()

Call:
lm(formula = exclamation_points ~ publish_order, data = filter(exclamation_points_summary, 
    !(book %in% novels)))

Residuals:
    Min      1Q  Median      3Q     Max 
-18.637 -10.027  -3.060   9.545  21.007 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    23.0608     3.8057   6.060 3.26e-07 ***
publish_order   0.1356     0.1351   1.003    0.321    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.86 on 42 degrees of freedom
Multiple R-squared:  0.02341,   Adjusted R-squared:  0.0001549 
F-statistic: 1.007 on 1 and 42 DF,  p-value: 0.3214



Sometimes there’s not anything particularly exciting with a type of analysis; sometimes just seeing that he remains consistent throughout is career as an awesome insight in itself.

Edit 2025-11-24

I received an excellent suggestion around normalization:

Excellent!! It would be cool to see exclamation marks divided by word count! Since longer works will have more by nature! How many exclamation marks can I get in this post?! 😂

— Libby Heeren (@libbyheeren.bsky.social) November 24, 2025 at 10:02 AM


Indeed, just because I only included the short stories in the previous visualisation does not mean that all of Doyle’s short stories are same length, and therefore, the absolute count comparision might not necessarily be fair. That said, there is one more method for normalization that is also worth considering. Hemingway would argue that most exclamation points should be replaced with periods. In addition to the above suggestion, I will also normalize by the dividing the sum of terminal periods and exclamation points. This one is a bit more tricky. I don’t want to count “Dr.” or “1.” as sentences. To correct for this, I will count periods immediately after a word that has at least one lowercase vowel.

exclamation_points_normalization <- paragraphs |>
  mutate(exclamation_points = str_count(text, "!"),
         words = lengths(str_split(text, " ")),
         terminal_periods = str_count(text, "\\b\\w*[aeiou]\\w*\\.") #thanks chatGPT
         )
sample <- exclamation_points_normalization |>
  ungroup() |>
  filter(exclamation_points > 0) |>
  filter(terminal_periods > 0) |>
  slice_sample(n=5)

sample |> #Trust but verify
  gt() |>
  tab_header(title = "Sample of Punctuation and Word Counts") |>
  opt_stylize(style = 6, color = "gray") |>
  tab_options(container.height = "300px", container.overflow.y = "scroll") 
Sample of Punctuation and Word Counts
book paragraph text exclamation_points words terminal_periods
The Adventure of the Solitary Cyclist 6820 "Enough of this," said my friend, coldly. "Drop that pistol! Watson, pick it up! Hold it to his head! Thank you. You, Carruthers, give me that revolver. We'll have no more violence. Come, hand it over!" 4 36 4
The Adventure of the Golden Pince-Nez 8009 "Tobacco and my work, but now only tobacco," the old man exclaimed. "Alas! what a fatal interruption! Who could have foreseen such a terrible catastrophe? So estimable a young man! I assure you that after a few months' training he was an admirable assistant. What do you think of the matter, Mr. Holmes?" 3 53 2
The Hound of the Baskervilles 8905 "Excellent! This is a colleague, Watson, after our own heart. But the marks?" 1 13 1
A Study In Scarlet 50 "Indeed!" I murmured. 1 3 1
The Hound of the Baskervilles 10056 "To see Sir Henry. Ah, here he is!" 1 8 1



exclamation_points_normalization_summary <- exclamation_points_normalization |>
  group_by(book) |>
  summarize(exclamation_points = sum(exclamation_points),
            terminal_periods = sum(terminal_periods),
            words = sum(words),
            exclamation_point_per_word = exclamation_points / words,
            exclamation_point_per_sentence_ending = exclamation_points / (terminal_periods + exclamation_points)) |>
  left_join(books, by = "book")

metrics <- tibble(
  Metric = c("Most Exclamations", "Least Exclamations", "Most Exclamations per Word", "Least Exclamations per Word", "Most Exclamations per Sentence Ending", "Least Exclamations per Sentence Ending"),
  book = c(
    exclamation_points_normalization_summary$book[which.max(exclamation_points_normalization_summary$exclamation_points)],
    exclamation_points_normalization_summary$book[which.min(exclamation_points_normalization_summary$exclamation_points)],
    exclamation_points_normalization_summary$book[which.max(exclamation_points_normalization_summary$exclamation_point_per_word)],
    exclamation_points_normalization_summary$book[which.min(exclamation_points_normalization_summary$exclamation_point_per_word)],
    exclamation_points_normalization_summary$book[which.max(exclamation_points_normalization_summary$exclamation_point_per_sentence_ending)],
    exclamation_points_normalization_summary$book[which.min(exclamation_points_normalization_summary$exclamation_point_per_sentence_ending)]
  ),
  Value = c(
    max(exclamation_points_normalization_summary$exclamation_points),
    min(exclamation_points_normalization_summary$exclamation_points),
    max(exclamation_points_normalization_summary$exclamation_point_per_word),
    min(exclamation_points_normalization_summary$exclamation_point_per_word),
    max(exclamation_points_normalization_summary$exclamation_point_per_sentence_ending),
    min(exclamation_points_normalization_summary$exclamation_point_per_sentence_ending)
  )
)

metrics |>
  gt() |>
  cols_label(
    book = "Book",
    Value = "Value"
  ) |>
  fmt_number(columns = Value, decimals = 4) |>
  tab_header(title = "Exclamation Point Usage Summary")
Exclamation Point Usage Summary
Metric Book Value
Most Exclamations The Valley Of Fear 319.0000
Least Exclamations The Musgrave Ritual 7.0000
Most Exclamations per Word The Adventure of the Dying Detective 0.0078
Least Exclamations per Word The Musgrave Ritual 0.0009
Most Exclamations per Sentence Ending The Adventure of the Solitary Cyclist 0.1004
Least Exclamations per Sentence Ending The Adventure of Wisteria Lodge 0.0172



exclamation_points_normalization_summary_for_plot <- exclamation_points_normalization_summary |>
  pivot_longer(cols = c(exclamation_point_per_word, exclamation_point_per_sentence_ending)) |>
  mutate(name = recode(name,
                         "exclamation_point_per_word" = "Per Word",
                         "exclamation_point_per_sentence_ending" = "Per Sentence Ending"))

exclamation_normalization_plot <- exclamation_points_normalization_summary_for_plot |>
  ggplot(aes(x = publish_order, y = value)) +
  facet_wrap(~name, scales = "free_y") +
  geom_point() +
  labs(x = "Publish Order",
       y = NULL,
       title = "Normalized Exclamation Point Usage")
exclamation_normalization_plot