leafleafleafDocy banner shape 01Docy banner shape 02Man illustrationFlower illustration

Python jieba word segmentation example: word segmentation of “Journey to the West” (1)

Estimated reading: 7 minutes 45 views

In this tutorial, we will learn how to use the jieba module to implement the word segmentation of the classical masterpiece “Journey to the West”, and graphically display the number of appearances of key characters, and further create a word cloud map.

read file

Because the content of the novel “Journey to the West” is very long, we are unlikely to put it into a string to operate, so we need to save it in a file. Then we need to operate the entire file and read the contents of the file. Our process of operating files is as follows:

1. Open the file, get the file handle and assign it to a variable;

2. 2. Operate the file through the handle

; Close the file.

To open a file, use the open() function. In fact, we introduced the method of opening the file before. If you forget, you can review what you have learned again.

Participle of “Journey to the West”

In the front, we have introduced how to use jieba library word segmentation, and how to open a text file. Next, we need to segment the classic novel “Journey to the West” and display the words with the highest frequency.

First of all, save “Journey to the West” to a text file. It should be noted that when saving the file, the encoding format should be selected as UTF-8, otherwise an error will be reported when reading the file. Also, we’re going to put this text file in the same folder as the program code, so we don’t need to specify a path.

Let’s first look at the program code used for word segmentation, the code is as follows.

import jieba
def takeSecond(elem):
    return elem[1]
def main():
    path = "Journey to the West.txt"
    file = open(path,"r",encoding="utf-8")
    text=file.read()
    file.close()
    words = jieba.lcut(text)
    counts = {}
    for word in words:
        counts[word] = counts.get(word,0) + 1
    items = list(counts.items())
    items.sort(key = takeSecond,reverse=True)   
    for i in range(20):
        item=items[i]
        keyWord =item[0]
        count=item[1]
        print("{0:<10}{1:>5}".format(keyWord,count))
main()

Because you want to use jieba’s functions, you first need to import the jieba module here.

import jieba

Next, we define two functions: main (the main function) and takeSecond (for getting the 2nd element of the list). Then, the variable path is defined to hold the relative path. Use the open() function to open the text file “Journey to the West.txt” in read-only mode, the specified encoding method is UTF-8, and assign the file handle to the variable file. Then call the read() method to read the contents of the file and save it to the variable text. Call the close() method to close the file.

Then we use the jieba.lcut() method to segment the contents of the variable text, and save the result list of the segmented words to the variable words. We create a new dictionary called counts. Then, through a loop statement, iterate over each element in the list words, using the variable word to represent each element.

In the loop, use word as the key of the dictionary counts, and add 1 to the value returned by the get() method as the value corresponding to this key. This means that every time the same key is encountered, its value will be incremented by 1 (to count the number of identical keys). It should be noted that if the value corresponding to the key is not found in the dictionary, the get() method will return the default value of 0. When the loop is over, the dictionary counts contains all the words split out in Journey to the West and the corresponding number of times the word appears.

Next, we want to sort by the number of occurrences of words. When we introduced the dictionary before, we mentioned that there is no way to sort the dictionary. We need to convert the dictionary into a list, and then use the sort() method of the list to sort. Because we use the number of characters to sort, we need to pass a key parameter to the sort() method to specify the element to be compared, which is taken from the iterable object.

The custom takeSecond() function is called here. This function accepts a list as an argument, and returns the second element of the list. In this way, we can specify the second element to be sorted in reverse order, and assign the result to items.

Then, use the range() function to generate an arithmetic array showing the first 20 elements in items. In each loop, we first assign the acquired element to the variable item. Then assign the first list element of item to the variable keyWord, and assign the second element to the variable count.

Then use the print() method to output the formatted two variables to the screen. Here we use the format() method, which can format the string as required. The meaning of the code is to align the value of keyWord to the left with a width of 10; align the value of count to the right with a width of 5.

The format() method of strings can be used to format strings. The format() method identifies the content to be replaced by the curly braces {} in the string, and the parameters in format() are the content to be filled in, which are matched in order. The < symbol after the colon in curly braces indicates left alignment, the > symbol indicates right alignment, and the number indicates the width.

Then call the main() function. The final word frequency statistics results are shown in Figure 1.


figure 1
We found a problem, most of the words are one word. That is, words of length 1 are not filtered. Next, we continue to optimize this program.

Filter words of length 1

Because we did not filter the results of word segmentation, most of the top 20 high-frequency words are one word, which is obviously not the result we want. Because one-word words do not have much meaning, and we need meaningful words, next, we introduce how to filter out words with a length of 1.

In order to facilitate the distinction, we will highlight the newly added code, the code is as follows.

import jieba
def takeSecond(elem):
    return elem[1]
def main():
    path = "Journey to the West.txt"
    file = open(path,'r',encoding="utf-8")
    text=file.read()
    file.close()
    words = jieba.lcut(text)
    counts = {}
    for word in words:
        if len(word) == 1:
            continue
        else:
            counts[word] = counts.get(word,0) + 1
    items = list(counts.items())
    items.sort(key = takeSecond,reverse=True)   
    for i in range(20):
        item=items[i]
        keyWord =item[0]
        count=item[1]
        print("{0:<10}{1:>5}".format(keyWord,count))
main()

Let’s take a look at the meaning of the highlighted new code. When traversing each element in the list words through a for loop, a judgment condition is added to the loop body. If the length of the variable word representing each word is equal to 1, it will directly enter the next loop; otherwise, the number of occurrences of this word will be counted.

Running the program, the word frequency statistics results obtained are shown in Figure 2.

Python jieba filter phrase
figure 2
The words obtained now are all phrases, and this result is more meaningful than single-character words. However, this is obviously not enough, because phrases such as “a”, “there” and “how” do not help us understand Journey to the West. In the next section, we will introduce how to remove such unwanted words.

Leave a Comment

CONTENTS