Python re.split()vs nltk word_tokenize和sent_tokenize

lov*_*sus 13 python regex nlp tokenize nltk

我正在经历这个问题.

我只是想知道NLTK在单词/句子标记化中是否比正则表达更快.

alv*_*vas 24

默认情况下nltk.word_tokenize()使用Treebank标记生成器,它模拟Penn Treebank标记生成器中的标记生成器.

请注意,str.split()在语言学意义上没有达到令牌,例如:

>>> sent = "This is a foo, bar sentence."
>>> sent.split()
['This', 'is', 'a', 'foo,', 'bar', 'sentence.']
>>> from nltk import word_tokenize
>>> word_tokenize(sent)
['This', 'is', 'a', 'foo', ',', 'bar', 'sentence', '.']
Run Code Online (Sandbox Code Playgroud)

它通常用于分隔具有指定分隔符的字符串,例如在制表符分隔文件中,当文本文件每行有一个句子时,您可以使用str.split('\t')或当您尝试使用换行符分割字符串\n时.

让我们做一些基准测试python3:

import time
from nltk import word_tokenize

import urllib.request
url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
response = urllib.request.urlopen(url)
data = response.read().decode('utf8')

for _ in range(10):
    start = time.time()
    for line in data.split('\n'):
        line.split()
    print ('str.split():\t', time.time() - start)

for _ in range(10):
    start = time.time()
    for line in data.split('\n'):
        word_tokenize(line)
    print ('word_tokenize():\t', time.time() - start)
Run Code Online (Sandbox Code Playgroud)

[OUT]:

str.split():     0.05451083183288574
str.split():     0.054320573806762695
str.split():     0.05368804931640625
str.split():     0.05416440963745117
str.split():     0.05299568176269531
str.split():     0.05304527282714844
str.split():     0.05356955528259277
str.split():     0.05473494529724121
str.split():     0.053118228912353516
str.split():     0.05236077308654785
word_tokenize():     4.056122779846191
word_tokenize():     4.052812337875366
word_tokenize():     4.042144775390625
word_tokenize():     4.101543664932251
word_tokenize():     4.213029146194458
word_tokenize():     4.411528587341309
word_tokenize():     4.162556886672974
word_tokenize():     4.225975036621094
word_tokenize():     4.22914719581604
word_tokenize():     4.203172445297241
Run Code Online (Sandbox Code Playgroud)

如果我们尝试另一个断词在前沿NLTKhttps://github.com/jonsafari/tok-tok/blob/master/tok-tok.pl:

import time
from nltk.tokenize import ToktokTokenizer

import urllib.request
url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
response = urllib.request.urlopen(url)
data = response.read().decode('utf8')

toktok = ToktokTokenizer().tokenize

for _ in range(10):
    start = time.time()
    for line in data.split('\n'):
        toktok(line)
    print ('toktok:\t', time.time() - start)
Run Code Online (Sandbox Code Playgroud)

[OUT]:

toktok:  1.5902607440948486
toktok:  1.5347232818603516
toktok:  1.4993178844451904
toktok:  1.5635688304901123
toktok:  1.5779635906219482
toktok:  1.8177132606506348
toktok:  1.4538452625274658
toktok:  1.5094449520111084
toktok:  1.4871931076049805
toktok:  1.4584410190582275
Run Code Online (Sandbox Code Playgroud)

(注意:文本文件的来源是https://github.com/Simdiva/DSL-Task)


如果我们查看本机perl实现,那么pythonvs的perl时间ToktokTokenizer是可比较的.但是在python实现中这样做,在perl中预编译正则表达式,它不是,但证据仍然在布丁中:

alvas@ubi:~$ wget https://raw.githubusercontent.com/jonsafari/tok-tok/master/tok-tok.pl
--2016-02-11 20:36:36--  https://raw.githubusercontent.com/jonsafari/tok-tok/master/tok-tok.pl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.31.17.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.31.17.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2690 (2.6K) [text/plain]
Saving to: ‘tok-tok.pl’

100%[===============================================================================================================================>] 2,690       --.-K/s   in 0s      

2016-02-11 20:36:36 (259 MB/s) - ‘tok-tok.pl’ saved [2690/2690]

alvas@ubi:~$ wget https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt
--2016-02-11 20:36:38--  https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.31.17.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.31.17.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3483550 (3.3M) [text/plain]
Saving to: ‘test.txt’

100%[===============================================================================================================================>] 3,483,550    363KB/s   in 7.4s   

2016-02-11 20:36:46 (459 KB/s) - ‘test.txt’ saved [3483550/3483550]

alvas@ubi:~$ time perl tok-tok.pl < test.txt > /tmp/null

real    0m1.703s
user    0m1.693s
sys 0m0.008s
alvas@ubi:~$ time perl tok-tok.pl < test.txt > /tmp/null

real    0m1.715s
user    0m1.704s
sys 0m0.008s
alvas@ubi:~$ time perl tok-tok.pl < test.txt > /tmp/null

real    0m1.700s
user    0m1.686s
sys 0m0.012s
alvas@ubi:~$ time perl tok-tok.pl < test.txt > /tmp/null

real    0m1.727s
user    0m1.700s
sys 0m0.024s
alvas@ubi:~$ time perl tok-tok.pl < test.txt > /tmp/null

real    0m1.734s
user    0m1.724s
sys 0m0.008s
Run Code Online (Sandbox Code Playgroud)

(注意:当计时时tok-tok.pl,我们不得不将输出传输到文件中,因此这里的时间包括机器输出到文件的nltk.tokenize.ToktokTokenizer时间,而在时间上,它不包括输出到文件的时间)


关于sent_tokenize(),它有点不同,比较速度基准而不考虑准确性有点古怪.

考虑一下:

  • 如果正则表达式将文本文件/段落分成1个句子,则速度几乎是瞬时的,即0完成工作.但那将是一个可怕的句子标记器......

  • 如果文件中的句子已经被分隔\n,那么这只是比较str.split('\n')vs re.split('\n')nltk句子标记化无关的情况; P

有关如何sent_tokenize()在NLTK中工作的信息,请参阅:

因此,要sent_tokenize()与其他基于正则表达式的方法进行有效比较(不是str.split('\n')),还必须评估准确性,并以标记化格式生成具有人工评估句子的数据集.

考虑这个任务:https://www.hackerrank.com/challenges/from-paragraphs-to-sentences

鉴于案文:

在第三类中,他包括那些在共济会中没有看到任何东西的兄弟(大多数),而是外在的形式和仪式,并且珍视这些形式的严格表现而不会担心他们的意图或意义.这就是Willarski,甚至是主要小屋的大师.最后,对于第四类,也有许多兄弟属于,特别是那些最近加入的兄弟.根据皮埃尔的观察,这些人是不相信任何东西,也不想要任何东西的人,而是加入了共济会只是为了与富有的年轻兄弟联系在一起,他们通过他们的联系或等级有影响力,而且他们中有很多人皮埃尔开始对他的所作所为感到不满.无论如何,他在这里看到的共济会,有时似乎只是基于外部的.他没有想到对共济会本身的怀疑,但怀疑俄罗斯砌体采取了错误的道路并偏离了原来的原则.所以到了年底,他出国了,开始接受命令的更高秘密.在这种情况下要做什么?为了支持革命,推翻一切,用武力击退武力?不!我们离那很远.每一次暴力改革都值得谴责,因为它完全无法弥补邪恶,而男人仍然保持现状,也因为智慧不需要暴力."但那样的是什么呢?" 伊拉金的新郎说."一旦她错过了它并把它转走了,任何杂种都可以接受它,"Ilagin同时说道,他的驰骋和兴奋让他喘不过气来.

我们希望得到这个:

In the third category he included those Brothers (the majority) who saw nothing in Freemasonry but the external forms and ceremonies, and prized the strict performance of these forms without troubling about their purport or significance.
Such were Willarski and even the Grand Master of the principal lodge.
Finally, to the fourth category also a great many Brothers belonged, particularly those who had lately joined.
These according to Pierre's observations were men who had no belief in anything, nor desire for anything, but joined the Freemasons merely to associate with the wealthy young Brothers who were influential through their connections or rank, and of whom there were very many in the lodge.
Pierre began to feel dissatisfied with what he was doing.
Freemasonry, at any rate as he saw it here, sometimes seemed to him based merely on externals.
He did not think of doubting Freemasonry itself, but suspected that Russian Masonry had taken a wrong path and deviated from its original principles.
And so toward the end of the year he went abroad to be initiated into the higher secrets of the order.
What is to be done in these circumstances?
To favor revolutions, overthrow everything, repel force by force?
No!
We are very far from that.
Every violent reform deserves censure, for it quite fails to remedy evil while men remain what they are, and also because wisdom needs no violence.
"But what is there in running across it like that?" said Ilagin's groom.
"Once she had missed it and turned it away, any mongrel could take it," Ilagin was saying at the same time, breathless from his gallop and his excitement.
Run Code Online (Sandbox Code Playgroud)

所以简单地做str.split('\n')就不会给你什么.即使不考虑句子的顺序,你也会得到0个正结果:

>>> text = """In the third category he included those Brothers (the majority) who saw nothing in Freemasonry but the external forms and ceremonies, and prized the strict performance of these forms without troubling about their purport or significance. Such were Willarski and even the Grand Master of the principal lodge. Finally, to the fourth category also a great many Brothers belonged, particularly those who had lately joined. These according to Pierre's observations were men who had no belief in anything, nor desire for anything, but joined the Freemasons merely to associate with the wealthy young Brothers who were influential through their connections or rank, and of whom there were very many in the lodge.Pierre began to feel dissatisfied with what he was doing. Freemasonry, at any rate as he saw it here, sometimes seemed to him based merely on externals. He did not think of doubting Freemasonry itself, but suspected that Russian Masonry had taken a wrong path and deviated from its original principles. And so toward the end of the year he went abroad to be initiated into the higher secrets of the order.What is to be done in these circumstances? To favor revolutions, overthrow everything, repel force by force?No! We are very far from that. Every violent reform deserves censure, for it quite fails to remedy evil while men remain what they are, and also because wisdom needs no violence. "But what is there in running across it like that?" said Ilagin's groom. "Once she had missed it and turned it away, any mongrel could take it," Ilagin was saying at the same time, breathless from his gallop and his excitement. """
>>> answer = """In the third category he included those Brothers (the majority) who saw nothing in Freemasonry but the external forms and ceremonies, and prized the strict performance of these forms without troubling about their purport or significance.
... Such were Willarski and even the Grand Master of the principal lodge.
... Finally, to the fourth category also a great many Brothers belonged, particularly those who had lately joined.
... These according to Pierre's observations were men who had no belief in anything, nor desire for anything, but joined the Freemasons merely to associate with the wealthy young Brothers who were influential through their connections or rank, and of whom there were very many in the lodge.
... Pierre began to feel dissatisfied with what he was doing.
... Freemasonry, at any rate as he saw it here, sometimes seemed to him based merely on externals.
... He did not think of doubting Freemasonry itself, but suspected that Russian Masonry had taken a wrong path and deviated from its original principles.
... And so toward the end of the year he went abroad to be initiated into the higher secrets of the order.
... What is to be done in these circumstances?
... To favor revolutions, overthrow everything, repel force by force?
... No!
... We are very far from that.
... Every violent reform deserves censure, for it quite fails to remedy evil while men remain what they are, and also because wisdom needs no violence.
... "But what is there in running across it like that?" said Ilagin's groom.
... "Once she had missed it and turned it away, any mongrel could take it," Ilagin was saying at the same time, breathless from his gallop and his excitement."""
>>> 
>>> output = text.split('\n')
>>> sum(1 for sent in text.split('\n') if sent in answer)
0
Run Code Online (Sandbox Code Playgroud)