leafleafleafDocy banner shape 01Docy banner shape 02Man illustrationFlower illustration

Python jieba word segmentation example: word segmentation of “Journey to the West” (below)

Estimated reading: 5 minutes 59 views

As we have already introduced in the previous tutorial, some phrases in Journey to the West are not helpful for understanding Journey to the West. In this section, we will introduce how to remove such unwanted words.

remove unwanted words

In fact, the words of people’s names are the most helpful for analyzing the plot and protagonist of the novel, so we need to further optimize the program to filter the unnecessary words.

The specific code is shown below, and the newly added code is still highlighted.

import jieba
def takeSecond(elem):
    return elem[1]
def main():
path = "Journey to the West.txt"
file = open(path,"r",encoding="utf-8")
text=file.read()
file.close()
words = jieba.lcut(text)
counts = {}
for word in words:
    if len(word) == 1:
        continue   
    else:
        counts[word] = counts.get(word,0) + 1
    file = open("excludes.txt","r")
    excludes =file.read().split(",")
    file.close
    for delWord in excludes:
        try:
            del counts[delWord]
        except:
            continue
    items = list(counts.items())
    items.sort(key = takeSecond,reverse=True)   
    for i in range(20):
        item=items[i]
        keyWord =item[0]
        count=item[1]
        print("{0:<10}{1:>5}".format(keyWord,count))
main()

Here’s what the highlighted code means. We put the phrases we want to remove from the high-frequency word results into a text file called excludes.txt. These words can be obtained by filtering from the results of the run in the previous section.

Read the file, and call the split method on the read, with “,” (comma) as the delimiter, and assign the result to a list called excludes. The elements in the excludes list are now the phrases we wanted to remove from the high frequency word results in the previous section.

Next, a new for loop loops through the elements in the list excludes, assigning each element to the variable delWord. In the body of the loop, use the del statement to delete the key-value pair in the dictionary counts whose key is delWord.

It should be noted here that if the dictionary does not contain the key to be deleted, the program will report an error. So we use the exception handling statement. When an exception occurs, use the continue statement in the except statement to jump to the next loop.

Run the program, and the word frequency statistics results obtained are shown in Figure 1.

Python jieba remove unwanted words
figure 1
The result now has the unwanted phrases removed. However, we found that the results are still somewhat flawed, because “Xingzhe”, “Great Sage”, “Old Sun” and “Wukong” all refer to “Monkey King”, and it is obviously inappropriate to separate statistics. Below, we’ll show you how to combine the same person’s name.

Merge names

From the previous word frequency statistics results, it can be seen that there are multiple address words for the same person in the output result. Therefore, we will further optimize the program to merge different appellations of the same person’s name.

The specific code is shown below, and the new code is highlighted.

import jieba
def takeSecond(elem):
    return elem[1]
def main():
    path = "Journey to the West.txt"
    file = open(path,"r",encoding="utf-8")
    text=file.read()
    file.close()
    words = jieba.lcut(text)
    counts = {}
    for word in words:
        if len(word) == 1:
            continue
        elif word == "Great Sage" or word=="Old Sun" or word=="Walker" or word=="Sun Dasheng" or word=="Sun Xingzhe" or word=="Monkey King" or word== "Wukong" or word=="Monkey King" or word=="Monkey":
            rword = "Sun Wukong"
        elif word == "Master" or word == "Sanzang" or word=="Saint Monk":
            rword = "Tang Monk"
        elif word == "nerd" or word=="Bajie" or word=="old pig":
            rword = "Pig Bajie"
        elif word=="Monk Sha":
            rword="Sand monk"
        elif word == "goblin" or word=="demon" or word=="demon":
            rword = "monster"
        elif word=="Buddha":
            rword="Tathagata"
        elif word=="Three Princes":
            rword="white horse"
        else:
            rword = word
        counts[rword] = counts.get(rword,0) + 1
    file = open("excludes.txt","r")
    excludes =file.read().split(",")
    file.close
    for delWord in excludes:
        try:
            del counts[delWord]
        except:
            continue
    items = list(counts.items())
    items.sort(key = takeSecond, reverse=True)
    for i in range(20):
        item=items[i]
        keyWord =item[0]
        count=item[1]
        print("{0:<10}{1:>5}".format(keyWord,count))
main()

Let’s take a look at the meaning of the new code. Here a new variable rword is created and used instead of word as the key of the dictionary counts. We will reassign the word variable word of the same person but with multiple titles to the variable rword.

Then, use rword as the key of the dictionary counts and its occurrence as the value. For example, we know that “Master”, “Sanzang” and “Saint Monk” are all honorary titles for “Tang Monk”, so we can classify these words as Tang Monk. Several other characters also have repeated titles and are handled in a similar way.

Running the program, the word frequency statistics results obtained are shown in Figure 2.

Merge names
figure 2
Now, we can get the words that appear the most in “Journey to the West”. Among them, “Monkey King” appears 6639 times and is a well-deserved protagonist, followed by “Tang Seng” and “Pig Bajie”. As a collective name for various villains, “Yokai” ranks fourth. “Sang Seng” ranked fifth, belonging to the four teachers and apprentices with the lowest sense of presence. This statistical result is completely consistent with our usual understanding of “Journey to the West”.

As shown in Figure 2, the result obtained now is too unintuitive. If you want to see the result more intuitively, we will introduce it in detail in the next section.

Leave a Comment

CONTENTS