(一)pyahocorasick和marisa_trie,字符串快速查找的python包,自然语言处理,命名实体识别可用的高效包

Pyahocorasick

安装Pyahocorasick

Pyahocorasick可以使用pip命令进行安装：

pip install pyahocorasick

使用Pyahocorasick

以下是使用Pyahocorasick进行字符串匹配的示例代码：

import ahocorasick # 构建模式匹配自动机 patterns = ['he', 'she', 'his', 'hers'] automaton = ahocorasick.Automaton( for pattern in patterns: automaton.add_word(pattern, pattern automaton.make_automaton( # 在文本中查找匹配 text = 'ushershewashis' matches = [] for end_index, matched_pattern in automaton.iter(text: start_index = end_index - len(matched_pattern + 1 matches.append((matched_pattern, start_index, end_index print(matches

输出：

[('she', 1, 3, ('he', 2, 3, ('hers', 2, 5, ('she', 5, 7, ('he', 6, 7, ('his', 11, 13]

Marisa_trie

安装Marisa_trie

Marisa_trie可以使用pip命令进行安装：

pip install marisa-trie

使用Marisa_trie

import marisa_trie

# 构建trie
short_strings = ['hello', 'world', 'python','py']
trie = marisa_trie.Trie(short_strings

# 匹配长字符串
long_string = 'this is a hello world example using python hello'

results = []
for i in range(len(long_string:
    matches = trie.prefixes(long_string[i:]

    # 输出匹配结果
    if matches:
        for matche in matches:
            results.append((matche,i,i+len(matche

print(results

[('hello', 10, 15, ('world', 16, 21, ('py', 36, 38, ('python', 36, 42, ('hello', 43, 48]

在以上示例代码中，我们首先构建了一个包含多个短字符串的Trie树。然后我们遍历文本中的所有前缀，并在Trie树中查找匹配的前缀。一旦找到匹配的前缀，我们可以计算匹配的起始和结束位置，并将它们添加到匹配列表中。

编程笔记 » (一)pyahocorasick和marisa_trie,字符串快速查找的python包,自然语言处理,命名实体识别可用的高效包