Type-Token Ratio
The full code can be found here and here.
I have provided the core function (type_token()) below.
Modules used: glob, pathlib, pandas, string, xlsxwriter
# Creating a function to do the type-token analysis for each text
def type_token(filepath, title):
with open(f'{filepath}', encoding='utf8') as f: # Opens the .txt file as variable f
full_passage = f.read().lower() # Reads the .txt file into a string and then closes the file
full_passage = full_passage.replace('\n', ' ') # Replaces new-line symbols with spaces
full_passage = full_passage.split() # Splits this string into a list of words
for word in full_passage: # For each word in the list of words...
word = clean_words(word) # This removes any remaining punctuation except for hyphens and makes it lowercase
tokens = len(full_passage) # Counts the amount of words in the list of all the document's words - this is the number of 'tokens' in the document
passage = [word for word in full_passage if word not in stopwords] # Makes a list of words with the stopwords removed (adds all the passage's words to a new list unless they are stopwords)
unique_terms = [passage[0]] # Creates a list of unique terms, starting with the passage's first non-stopword word
for word in passage: # For each word in the passage...
if word not in unique_terms: # If it hasn't already been recorded as a unique word...
unique_terms.append(word) # This adds the word to the list of unique words
types = len(unique_terms) # Counts the amount of words in the list of the document's unique words - this is the number of 'types'
ttr = round(types/tokens, 5) # Calculates the type-token ratio (number of types divided by number of tokens) and rounds this to five decimal places
passage_date = title[0:4] # Creates a value for the document's year (the first four characters of the file's name)
passage_month = title[5:7] # Creates a value for the document's month (the fifth and sixth characters of the file's name)
if passage_month[0] == '0': # If the document's month has a trailing zero
passage_month.replace('0', '') # This removes it
passage_month_n = int(passage_month) # And this converts it to an integer to do some maths
passage_month_n = passage_month_n - 1 # And this subtracts one for the purpose of some maths
# The following maths works by estimating one month into the year being 0.08333, so, if the year is 2023, January is 2023.000000 (so the month needs 1 subtracted), February is 2023.08333 and so on; the purpose of all this is to make visualisations in Excel easier
passage_month_calc = round(passage_month_n * 0.08333, 5) # Multiplies the month by 0.08333
passage_month = str(passage_month_calc) # Converts this float value back to a string
passage_month = passage_month[1:] # Removes the starting zero
passage_date = passage_date + passage_month # Adds this decimal onto the end of the year string
document_type_token = {'text': title, 'date': passage_date, 'TTR': ttr, 'types': types,'tokens': tokens} # Creates a dictionary containing the document's title, number of types, number of tokens, and type-token ratio (TTR)
return document_type_token # Returns this list to be used by the wider loop
Sentiment Analysis
The full code can be found here.
I have provided the core function (find_sentiment()) below.
Modules used: vaderSentiment, glob, pathlib, nltk, pandas, statistics, xlsxwriter
# Creating a function to do the sentiment analysis for each 5% of the text
def find_sentiment(filepath,title):
with open(f'{filepath}', encoding='utf8') as f: # Opens the .txt file as variable f
text = f.read() # Reads the .txt file into a string and then closes the file
text = text.replace('\n', ' ') # Replaces new-line symbols with spaces
text_sentences = nltk.sent_tokenize(text) # Splits the text up into a list of sentences using a NLTK (Natural Language Toolkit) method
no_sentences = len(text_sentences) # Finds the number of sentences by getting the length of the list of sentences
start_pos = 0 # Creates a starting position for sentence number, beginning at zero (counting starts at zero in Python)
five_percent = round(no_sentences/20) # Calculates how many sentences are equivalent to five percent of the whole document
section_scores = {'text':f'{title}',} # Creates an empty dictionary which will go on to contain the sentiment score for each five percent, and records the document's name as its first key/value pair
for i in range(1, 21): # For each five percent...
total_scores = [] # This makes an empty list which will store the scores for each sentence
for sentence in text_sentences[start_pos:start_pos+five_percent]: # For every sentence in this five percent...
scores = sentimentAnalyser.polarity_scores(sentence) # This works out its scores
total_scores.append(scores['compound']) #This adds its overall score to the list of scores
if len(total_scores) != 0: # If there are sentences in the list of scores for this five percent...
scores_count = mean(total_scores) # This works out the average of this scores and stores it
else: # If there aren't...
scores_count = 0 # This sets the overall score for that section to zero
section_scores[i*5] = float(scores_count) # Makes a new key/value pair in the dictionary, with the key of which percent it is, and the average sentiment value
start_pos += five_percent # Increases the starting position to after this five percent, so we work with the next five percent of the document
return section_scores # Returns the dictionary of a document's sentiment scores
Natural Language Processing
The full code can be found here.
I have provided the core function (top_speech_parts()) below.
Modules used: spacy, collections, pandas, glob, pathlib
# Creating a function for working out the most common adjectives, nouns, pronouns and verbs in each text
def top_speech_parts(filepath):
with open(filepath, encoding="utf-8") as f: # Opens the .txt file (document) as variable f
document = nlp(f.read().lower()) # Reads the .txt file into the 'document' variable, applies NLP to it and then closes the file
adjs = [] # Creates an empty list that will store the adjective data
nouns = [] # Creates an empty list that will store the noun data
pronouns = [] # Creates an empty list that will store the pronoun data
verbs = [] # Creates an empty list that will store the verb data
for token in document: # For every word in the document...
if token.pos_ == 'ADJ': # If it is an adjective...
adjs.append(token.text) # Record it to the adjective list
elif token.pos_ == 'NOUN': # If it is a noun...
nouns.append(token.text) # Record it to the noun list
elif token.pos_ == 'PRON': # If it is a pronoun...
pronouns.append(token.text) # Record it to the pronoun list
elif token.pos_ == 'VERB': # If it is a verb...
verbs.append(token.text) # Record it to the verb list
else: # If it is not any of these...
pass # Go on to the next word
tokens = make_tokens(document) # Stores the value of what the make_tokens function created below returns
adjs_tally = Counter(adjs) # Counts how many times each adjective is used in a tuple with the word and its number of incidences
adjs_tally = adjs_tally.most_common() # Reorders this data from most common to least common
nouns_tally = Counter(nouns) # Counts how many times each noun is used in a tuple with the word and its number of incidences
nouns_tally = nouns_tally.most_common() # Reorders this data from most common to least common
pronouns_tally = Counter(pronouns) # Counts how many times each pronoun is used in a tuple with the word and its number of incidences
pronouns_tally = pronouns_tally.most_common() # Reorders this data from most common to least common
verbs_tally = Counter(verbs) # Counts how many times each verb is used in a tuple with the word and its number of incidences
verbs_tally = verbs_tally.most_common() # Reorders this data from most common to least common
return adjs_tally, nouns_tally, pronouns_tally, verbs_tally, tokens # Returns the tallies of adjectives, nouns, pronouns and verbs, and the list of tokens in the text and their part-of-speech
Notes for all code
In all of the code here, data was put into data frames using pandas and then exported to .xlsx files to be viewed in Excel. I formatted data in Excel for improved readability and to create visualisations. In the case of the "Natural Language Processing" code, there was a unique Excel file for each document; for ease of viewing and working with the data, I combined data from each these files into one document. Each Jupyter Notebook file has a collection of update logs at the very bottom.