ATTRIBUTING AUTHORSHIP WITH STYLOMETRYStylometry is the quantitative study of literary style through computational text analysis. It’s based on the idea that we all have a unique, consistent, and recognizable style to our writing. This includes our vocabulary, our use of punctuation, the average length of our sentences and words, and so on.
A common application of stylometry is authorship attribution. Do you ever wonder if Shakespeare really wrote all his plays? Or if John Lennon or Paul McCartney wrote the song “In My Life”? Could Robert Galbraith, author of A
Cuckoo’s Calling, really be J. K. Rowling in disguise? Stylometry can find the answer!
Stylometry has been used to overturn murder convictions and even helped identify and convict the Unabomber in 1996. Other uses include detecting plagiarism and determining the emotional tone behind words, such as in social media posts. Stylometry can even be used to detect signs of mental depression and suicidal tendencies.
In this chapter, you’ll use multiple stylometric techniques to determine whether Sir Arthur Conan Doyle or H. G. Wells wrote the novel
The Lost World.
Project #2: The Hound, The War, and The Lost World
Sir Arthur Conan Doyle (1859–1930) is best known for the Sherlock Holmes stories, considered milestones in the field of crime fiction. H. G. Wells (1866–1946) is famous for several groundbreaking science fiction novels including T
he War of The Worlds, The Time Machine, The Invisible Man, and
The Island of Dr. Moreau.
In 1912, the
Strand Magazine published
The Lost World, a serialized version of a science fiction novel. It told the story of an Amazon basin expedition, led by zoology professor George Edward Challenger, that encountered living dinosaurs and a vicious tribe of ape-like creatures.
Although the author of the novel is known, for this project, let’s pretend it’s in dispute and it’s your job to solve the mystery. Experts have narrowed the field down to two authors, Doyle and Wells. Wells is slightly favored because
The Lost World is a work of science fiction, which is his purview. It also includes brutish troglodytes redolent of the morlocks in his 1895 work
The Time Machine. Doyle, on the other hand, is known for detective stories and historical fiction.
THE OBJECTIVEWrite a Python program that uses stylometry to determine whether Sir Arthur Conan Doyle or H. G. Wells wrote the novel T
he Lost World.
THE STRATEGYThe science of
natural language processing (NLP) deals with the interactions between the precise and structured language of computers and the nuanced, frequently ambiguous “natural” language used by humans. Example uses for NLP include machine translations, spam detection, comprehension of search engine questions, and predictive text recognition for cell phone users.
The most common NLP tests for authorship analyze the following features of a text:
•
Word length A frequency distribution plot of the length of words in a document
•
Stop words A frequency distribution plot of stop words (short, noncontextual function words like
the, but, and if)
•
Parts of speech A frequency distribution plot of words based on their syntactic functions (such as nouns, pronouns, verbs, adverbs, adjectives, and so on)
•
Most common words A comparison of the most commonly used words in a text
•
Jaccard similarity A statistic used for gauging the similarity and diversity of a sample set
If Doyle and Wells have distinctive writing styles, these five tests should be enough to distinguish between them. We’ll talk about each test in more detail in the coding section.
To capture and analyze each author’s style, you’ll need a representative
corpus, or a body of text. For Doyle, use the famous Sherlock Holmes novel
The Hound of the Baskervilles, published in 1902. For Wells, use
The War of the Worlds, published in 1898. Both these novels contain more than 50,000 words, more than enough for a sound statistical sampling. You’ll then compare each author’s sample to
The Lost World to determine how closely the writing styles match.
To perform stylometry, you’ll use the
Natural Language Toolkit (NLTK), a popular suite of programs and libraries for working with human language data in Python. It’s free and works on Windows, macOS, and Linux. Created in 2001 as part of a computational linguistics course at the
University of Pennsylvania, NLTK has continued to develop and expand with the help of dozens of contributors.
Copyright © 2020 by Lee Vaughan. All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.