A (Short) Study in Sherlock

Type-Token Ratio

The full code can be found here and here.

I have provided the core function (type_token()) below.

Modules used: glob, pathlib, pandas, string, xlsxwriter

# Creating a function to do the type-token analysis for each text
def type_token(filepath, title):
    with open(f'{filepath}', encoding='utf8') as f: # Opens the .txt file as variable f
        full_passage = f.read().lower() # Reads the .txt file into a string and then closes the file
    full_passage = full_passage.replace('\n', ' ') # Replaces new-line symbols with spaces
    full_passage = full_passage.split() # Splits this string into a list of words
    for word in full_passage: # For each word in the list of words...
        word = clean_words(word) # This removes any remaining punctuation except for hyphens and makes it lowercase
    tokens = len(full_passage) # Counts the amount of words in the list of all the document's words - this is the number of 'tokens' in the document
    passage = [word for word in full_passage if word not in stopwords] # Makes a list of words with the stopwords removed (adds all the passage's words to a new list unless they are stopwords)

    unique_terms = [passage[0]] # Creates a list of unique terms, starting with the passage's first non-stopword word
    for word in passage: # For each word in the passage...
        if word not in unique_terms: # If it hasn't already been recorded as a unique word...
            unique_terms.append(word) # This adds the word to the list of unique words
    types = len(unique_terms) # Counts the amount of words in the list of the document's unique words - this is the number of 'types'
    ttr = round(types/tokens, 5) # Calculates the type-token ratio (number of types divided by number of tokens) and rounds this to five decimal places
    passage_date = title[0:4] # Creates a value for the document's year (the first four characters of the file's name)
    passage_month = title[5:7] # Creates a value for the document's month (the fifth and sixth characters of the file's name)
    if passage_month[0] == '0': # If the document's month has a trailing zero
        passage_month.replace('0', '') # This removes it
    passage_month_n = int(passage_month) # And this converts it to an integer to do some maths
    passage_month_n = passage_month_n - 1 # And this subtracts one for the purpose of some maths

    # The following maths works by estimating one month into the year being 0.08333, so, if the year is 2023, January is 2023.000000 (so the month needs 1 subtracted), February is 2023.08333 and so on; the purpose of all this is to make visualisations in Excel easier
    passage_month_calc = round(passage_month_n * 0.08333, 5) # Multiplies the month by 0.08333
    passage_month = str(passage_month_calc) # Converts this float value back to a string
    passage_month = passage_month[1:] # Removes the starting zero
    passage_date = passage_date + passage_month # Adds this decimal onto the end of the year string
    document_type_token = {'text': title, 'date': passage_date, 'TTR': ttr, 'types': types,'tokens': tokens} # Creates a dictionary containing the document's title, number of types, number of tokens, and type-token ratio (TTR)
    return document_type_token # Returns this list to be used by the wider loop

Sentiment Analysis

The full code can be found here.

I have provided the core function (find_sentiment()) below.

Modules used: vaderSentiment, glob, pathlib, nltk, pandas, statistics, xlsxwriter

                            # Creating a function to do the sentiment analysis for each 5% of the text
def find_sentiment(filepath,title):
    with open(f'{filepath}', encoding='utf8') as f: # Opens the .txt file as variable f
        text = f.read() # Reads the .txt file into a string and then closes the file
    text = text.replace('\n', ' ') # Replaces new-line symbols with spaces
    text_sentences = nltk.sent_tokenize(text) # Splits the text up into a list of sentences using a NLTK (Natural Language Toolkit) method
    
    no_sentences = len(text_sentences) # Finds the number of sentences by getting the length of the list of sentences
    start_pos = 0 # Creates a starting position for sentence number, beginning at zero (counting starts at zero in Python)
    five_percent = round(no_sentences/20) # Calculates how many sentences are equivalent to five percent of the whole document

    section_scores = {'text':f'{title}',} # Creates an empty dictionary which will go on to contain the sentiment score for each five percent, and records the document's name as its first key/value pair

    for i in range(1, 21): # For each five percent...
        total_scores = [] # This makes an empty list which will store the scores for each sentence
        for sentence in text_sentences[start_pos:start_pos+five_percent]: # For every sentence in this five percent...
            scores = sentimentAnalyser.polarity_scores(sentence) # This works out its scores
            total_scores.append(scores['compound']) #This adds its overall score to the list of scores
        
        if len(total_scores) != 0: # If there are sentences in the list of scores for this five percent...
            scores_count = mean(total_scores) # This works out the average of this scores and stores it
        else: # If there aren't...
            scores_count = 0 # This sets the overall score for that section to zero

        section_scores[i*5] = float(scores_count) # Makes a new key/value pair in the dictionary, with the key of which percent it is, and the average sentiment value
        start_pos += five_percent # Increases the starting position to after this five percent, so we work with the next five percent of the document

    return section_scores # Returns the dictionary of a document's sentiment scores

Natural Language Processing

The full code can be found here.

I have provided the core function (top_speech_parts()) below.

Modules used: spacy, collections, pandas, glob, pathlib

                            # Creating a function for working out the most common adjectives, nouns, pronouns and verbs in each text
def top_speech_parts(filepath):
    with open(filepath, encoding="utf-8") as f: # Opens the .txt file (document) as variable f
        document = nlp(f.read().lower()) # Reads the .txt file into the 'document' variable, applies NLP to it and then closes the file
    adjs = [] # Creates an empty list that will store the adjective data
    nouns = [] # Creates an empty list that will store the noun data
    pronouns = [] # Creates an empty list that will store the pronoun data
    verbs = [] # Creates an empty list that will store the verb data
    
    for token in document: # For every word in the document...
        if token.pos_ == 'ADJ': # If it is an adjective...
            adjs.append(token.text) # Record it to the adjective list
        elif token.pos_ == 'NOUN': # If it is a noun...
            nouns.append(token.text) # Record it to the noun list
        elif token.pos_ == 'PRON': # If it is a pronoun...
            pronouns.append(token.text) # Record it to the pronoun list
        elif token.pos_ == 'VERB': # If it is a verb...
            verbs.append(token.text) # Record it to the verb list
        else: # If it is not any of these...
            pass # Go on to the next word
    
    tokens = make_tokens(document) # Stores the value of what the make_tokens function created below returns

    adjs_tally = Counter(adjs) # Counts how many times each adjective is used in a tuple with the word and its number of incidences
    adjs_tally = adjs_tally.most_common() # Reorders this data from most common to least common

    nouns_tally = Counter(nouns) # Counts how many times each noun is used in a tuple with the word and its number of incidences
    nouns_tally = nouns_tally.most_common() # Reorders this data from most common to least common

    pronouns_tally = Counter(pronouns) # Counts how many times each pronoun is used in a tuple with the word and its number of incidences
    pronouns_tally = pronouns_tally.most_common() # Reorders this data from most common to least common

    verbs_tally = Counter(verbs) # Counts how many times each verb is used in a tuple with the word and its number of incidences
    verbs_tally = verbs_tally.most_common() # Reorders this data from most common to least common
    
    return adjs_tally, nouns_tally, pronouns_tally, verbs_tally, tokens # Returns the tallies of adjectives, nouns, pronouns and verbs, and the list of tokens in the text and their part-of-speech

Notes for all code

In all of the code here, data was put into data frames using pandas and then exported to .xlsx files to be viewed in Excel. I formatted data in Excel for improved readability and to create visualisations. In the case of the "Natural Language Processing" code, there was a unique Excel file for each document; for ease of viewing and working with the data, I combined data from each these files into one document. Each Jupyter Notebook file has a collection of update logs at the very bottom.

Cube icon by Icons8