Python jieba library word segmentation
In the previous tutorial, we have installed the jieba library. In this tutorial, we will explain how to segment the jieba library.
The jieba library is an excellent third-party Chinese word segmentation library for Python that supports 3 word segmentation modes: exact mode, full mode and search engine mode. The characteristics of these three modes are as follows.
- Precise mode: try to segment the sentence most accurately, there is no redundant data, suitable for text analysis.
- Full mode: It is very fast to segment all words that may be words in the sentence, but there is redundant data, which cannot resolve ambiguity.
- The search engine mode, on the basis of the precise mode, divides the long words again to improve the recall rate, which is suitable for word segmentation of search engines.
You can use the jieba.lcut() and jieba.lcut_for_search() methods for word segmentation, both of which return a list. jieba.lcut() accepts 3 parameters: the string that needs to be split, whether to use the full mode (the default value is False), and whether to use the HMM model (the default value is True). jieba.lcut_for_search() accepts two parameters: the string that needs to be segmented and whether to use the HMM model.
Tip: HMM (Hidden Markov Model) model, also known as Hidden Markov Model, is a statistical analysis model based on probability, readers only need to know that this is a proper name.The above concepts may be a bit abstract, let’s actually feel the effect of these three word segmentation modes through examples! code show as below.
import jieba segStr = "Jiangzhou Yangtze River Bridge participated in the opening ceremony of the Yangtze River Bridge" joinChar=" / " print("Exact mode: " + joinChar.join(jieba.lcut(segStr))) print("Full mode: " + joinChar.join(jieba.lcut(segStr,cut_all=True))) print("HMM mode not enabled: " + joinChar.join(jieba.lcut(segStr,HMM=False))) print("Search engine mode: " + joinChar.join(jieba.lcut_for_search(segStr)))
The results obtained are shown in Figure 3.
Before executing the program, jieba will initialize and load the default thesaurus. If we want to load a more comprehensive thesaurus, we can replace the default initialization thesaurus. The jieba default thesaurus is the dict.txt file located in the module installation path.
We created a variable segStr to represent the string to be processed by word segmentation. The content of the string is: “Jiangzhou Yangtze River Bridge participated in the opening ceremony of the Yangtze River Bridge”.
We also create a variable called joinChar and assign the slash as a string to the entire variable. Because we want to use the join() method of the string to join the elements in the sequence with the specified character to generate a new string, here we use the slash to join the elements in the sequence.
The join() method is used to join the elements in the sequence with the specified characters to generate a new string. The syntax of the join() method is: str.join(sequence). The parameter sequence represents the sequence of elements to be concatenated. For example, we want to use dashes to concatenate all elements in a tuple into a string.
>> joinChar="-" >> seq=("a","b","c") >> print (joinChar.join(seq)) a-b-c
Next, use the jieba.lcut() method for word segmentation. We only pass a string parameter to this method, because by default the full mode is not used and the HMM model is used, so a list of exact mode tokenization is returned. We use the join() method to join the returned list with slashes and output it to the screen. The output is as follows:
Precise Mode: Jiangzhou/ City/ Yangtze River Bridge/ Attended/ Attended/ Yangtze River Bridge/ Open to Traffic/ Ceremony
It can be seen that the split words are basically the phrases we use every day.
Next, in addition to passing string parameters to the jieba.lcut() method, we also specify the use of full-mode word segmentation, which will return a list of full-mode word segmentation, and the output is as follows:
Full mode: Jiangzhou / City / Mayor / Yangtze River / Yangtze River Bridge / Bridge / Participate / Attend / Yangtze River / Yangtze River Bridge / Bridge / Open to traffic / Ceremony
It can be seen that using the full mode word segmentation, “Jiangzhou” can not only be split into “Jiangzhou”, but also into “Jiangzhou” and “Zhou”. Similarly, “electronic games” can also be split into four phrases: “electronics”, “electronic games”, “sub-you” and “games”. It can be seen from the results that compared with the exact mode, the full mode will produce a lot of ambiguity.
Then we don’t use the HMM model and see the effect.
HMM mode not enabled: Jiangzhou / City / Yangtze River Bridge / Participated / Attended / Yangtze River Bridge / Opening to Traffic / Ceremony
It can be seen that for this string, the results of splitting are similar with and without the HMM model, but the splitting of the name is different.
Finally, we use the search engine mode, that is, use the jieba.lcut_for_search() method instead of the jieba.lcut() method. The passed parameter is still the same string to be split, and the result of the word segmentation is as follows.
Search engine mode: Jiangzhou / City / Yangtze River / Bridge / Yangtze River Bridge / Participate / Attended / Yangtze River / Bridge / Yangtze River Bridge / Opening to Traffic / Ceremony