Text Analysis

5 min readJun 30, 2023

Have you ever wondered how machines can understand our text?

Before machines can comprehend our text, we need to enable them to understand it in their own language.

Text Analysis — It is a methodology used to explore the structure, language, and symbols within written or verbal content. It is employed to gain insights into how people communicate with one another and to understand the information conveyed.

Textual analysis involves examining qualitative data to discern patterns, themes, and characteristics of the text under scrutiny.
It is important to differentiate between text analysis and content analysis. While content analysis is a subset of text analysis, focusing specifically on the analysis of content, text analysis encompasses a broader scope. Text analysis encompasses processes such as data pre-processing, where raw data is transformed and prepared for analysis. Content analysis, on the other hand, does not involve the same level of data pre-processing.

Overall, textual analysis serves as a valuable tool for researchers, linguists, and professionals in various fields to gain deeper insights into human communication patterns and the intricate nature of textual content.

Let’s delve deeper into the concept of text analysis, which encompasses a range of sub-concepts and methodologies

Positive Score -

In this context, a dictionary of positive words is utilized. The text column is examined to determine whether any positive words from the dictionary are present. Numerically, a value of +1 is added to the column if a positive word from the dictionary is found in the text column.

Negative Score -

Similar to the positive score, a dictionary of negative words is employed. The text column is searched to identify any negative words from the dictionary. Like before, a value of +1 is assigned to the column for each negative word found.

def calculate_positive_score(text):
    words = text.split()
    positive_score = 0                # initiate with 0
    for word in words:
        if word in positive_words:    # if found any pos word from the positive file
            positive_score += 1       # add +1 
    return positive_score


def calculate_negative_score(text):
    words = text.split() 
    negative_score = 0                 # initiate with 0
    for word in words:
        if word in negative_words:     # if found any neg word from the negative file
            negative_score -= 1        # add -1
    return negative_score

Polarity Score -

The polarity score is calculated to assess whether the overall text leans towards being positive or negative. The formula used to compute the polarity score is as follows:
Polarity Score = (Positive Score — Negative Score) / ((Positive Score + Negative Score) + 0.000001)
The approach employed here is lexicon-based, relying on the predefined positive and negative word dictionaries to evaluate the sentiment of the text

def calculate_polarity_score(positive_score, negative_score):
    polarity_score = (positive_score - negative_score) / ((positive_score + negative_score) + 0.000001)
    return polarity_score

Subjective score -

The subjective score is used to assess the level of subjectivity or opinion expressed in a text. It is calculated using the formula:
Subjective Score = (Positive Score + Negative Score) / ((Total Words after cleaning) + 0.000001)

def calculate_subjectivity_score(positive_score, negative_score, total_words):
    subjectivity_score = (positive_score + negative_score) / (total_words + 0.000001)
    return subjectivity_score

Analysis of Readability -

Readability analysis helps determine how easy or difficult it is to understand a given text. It involves several metrics:

Average Sentence Length — Average Sentence Length = (Number of Words) / (Number of Sentences)
Percentage of Complex Words — Complex words are those that contain longer syllables or are considered more difficult to understand. To calculate the percentage of complex words, we divide the number of complex words by the total number of words in the text. Percentage of Complex Words = (Number of Complex Words) / (Number of Words)
Fog Index — The Fog Index is a readability formula that combines the average sentence length and the percentage of complex words. It is calculated using the formula:

Fog Index = 0.4 * (Average Sentence Length + Percentage of Complex Words)
The resulting Fog Index score represents the number of years of education required to understand the text. For example, a Fog Index of 8 means the text is written at an eighth-grade reading level.
The Fog Index is often used to assess the readability of technical and academic writing, as well as other types of prose. It helps determine the level of education needed to comprehend the text effectively.

def calculate_fog_index(average_sentence_length, percentage_complex_words):
    fog_index = 0.4 * (average_sentence_length + percentage_complex_words)
    return fog_index

Complex Word Count

The complex word count refers to the number of words in the text that have more than two syllables.

from nltk.corpus import cmudict
nltk.download('cmudict')
syllable_dict = cmudict.dict()

def count_syllables(word):
    if word.lower() not in syllable_dict:    # search for lower case version of the word in dictionary 
        return 0
    return [len(list(y for y in x if y[-1].isdigit())) for x in syllable_dict[word.lower()]][0]
                                               # return number of syllable

def is_complex(word):
    syllable_count = count_syllables(word)
    return syllable_count > 2

def count_complex_words(text):
    words = nltk.word_tokenize(text)
    num_complex_words = sum(is_complex(word) for word in words)
    return num_complex_words

df['complex_word_count'] = df['Text'].apply(count_complex_words)
df

Cleaned Word Count

The cleaned word count involves removing stop words (commonly used words like “the,” “and,” “is”) and punctuation marks from the text. After these removals, the remaining words are counted.

Syllable Count Per Word

For each word in the text, the syllable count is determined by counting the number of vowels present. Certain exceptions, such as words ending in “es” or “ed,” are handled by not counting them as separate syllables.

def count_syllables(word):
    word = word.lower()
    num_vowels = len([char for char in word if char in 'aeiou'])
    if word.endswith('es') or word.endswith('ed'):
        num_vowels -= 1           # do not include 'es' and 'ed' of found
    return num_vowels

def count_syllables_per_word(text):
    words = nltk.word_tokenize(text)
    syllable_counts = [count_syllables(word) for word in words]
    return syllable_counts

df['syllable_count_per_word'] = df['Text'].apply(count_syllables_per_word)
df

Personal Pronouns

To calculate the personal pronouns mentioned in the text, regex (regular expressions) is used to find the counts of words such as “I,” “we,” “my,” “ours,” and “us.” Special care is taken to exclude instances where the country name “US” is mistakenly included in the count.

import re
def count_personal_pronouns(text):
    pattern = r"\b(I|we|my|ours|us)\b"   # pattern to check if those words exists
    pattern = r"(?<!\bUS\b)" + pattern   # pattern should not include US instead of us
    matches = re.findall(pattern, text, flags=re.IGNORECASE)
    count = len(matches)
    return count
df['personal_pronoun_count'] = df['Text'].apply(count_personal_pronouns)
df

Average Word Length

The average word length is calculated by summing the total number of characters in each word and dividing it by the total number of words in the text. This metric provides insights into the typical length of words used in the text.


def calculate_avg_word_length(text):
    words = text.split()
    total_characters = sum(len(word) for word in words)  # sum of the charachter
    num_words = len(words)                               # total of words
    avg_word_length = total_characters / num_words
    return avg_word_length
df['avg_word_length'] = df['Text'].apply(calculate_avg_word_length)
df

By analyzing these metrics, we can gain a better understanding of various aspects of the text, such as the complexity of vocabulary, the number of personal pronouns used, and the average length of words. These insights can be useful for various purposes, such as language analysis, writing style evaluation, or assessing the readability of a piece of text.

More insightful blogs coming soon! Stay in the loop by following me for regular updates.

Text Analysis

Written by Celestial

No responses yet