从Python中的unicode字符串中删除标点符号的最快方法

Mic*_*ael 20 python regex unicode python-2.7

我试图有效地从unicode字符串中删除标点符号.使用常规字符串,使用mystring.translate(None, string.punctuation)显然是最快的方法.但是,此代码在Python 2.7中打破了unicode字符串.正如对这个答案的评论所解释的那样,翻译方法仍然可以实现,但必须用字典来实现.当我使用这个实现时,我发现translate的性能大大降低了.这是我的计时代码(主要从这个答案复制):

import re, string, timeit
import unicodedata
import sys


#String from this article www.wired.com/design/2013/12/find-the-best-of-reddit-with-this-interactive-map/

s = "For me, Reddit brings to mind Obi Wan’s enduring description of the Mos Eisley cantina: a wretched hive of scum and villainy. But, you know, one you still kinda want to hang out in occasionally. The thing is, though, Reddit isn’t some obscure dive bar in a remote corner of the universe—it’s a huge watering hole at the very center of it. The site had some 400 million unique visitors in 2012. They can’t all be Greedos. So maybe my problem is just that I’ve never been able to find the places where the decent people hang out."
su = u"For me, Reddit brings to mind Obi Wan’s enduring description of the Mos Eisley cantina: a wretched hive of scum and villainy. But, you know, one you still kinda want to hang out in occasionally. The thing is, though, Reddit isn’t some obscure dive bar in a remote corner of the universe—it’s a huge watering hole at the very center of it. The site had some 400 million unique visitors in 2012. They can’t all be Greedos. So maybe my problem is just that I’ve never been able to find the places where the decent people hang out."


exclude = set(string.punctuation)
regex = re.compile('[%s]' % re.escape(string.punctuation))

def test_set(s):
    return ''.join(ch for ch in s if ch not in exclude)

def test_re(s):  # From Vinko's solution, with fix.
    return regex.sub('', s)

def test_trans(s):
    return s.translate(None, string.punctuation)

tbl = dict.fromkeys(i for i in xrange(sys.maxunicode)
                      if unicodedata.category(unichr(i)).startswith('P'))

def test_trans_unicode(su):
    return su.translate(tbl)

def test_repl(s):  # From S.Lott's solution
    for c in string.punctuation:
        s=s.replace(c,"")
    return s

print "sets      :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000)
print "regex     :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000)
print "translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000)
print "replace   :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000)

print "sets (unicode)      :",timeit.Timer('f(su)', 'from __main__ import su,test_set as f').timeit(1000000)
print "regex (unicode)     :",timeit.Timer('f(su)', 'from __main__ import su,test_re as f').timeit(1000000)
print "translate (unicode) :",timeit.Timer('f(su)', 'from __main__ import su,test_trans_unicode as f').timeit(1000000)
print "replace (unicode)   :",timeit.Timer('f(su)', 'from __main__ import su,test_repl as f').timeit(1000000)
Run Code Online (Sandbox Code Playgroud)

正如我的结果所示,翻译的unicode实现可怕地执行:

sets      : 38.323941946
regex     : 6.7729549408
translate : 1.27428412437
replace   : 5.54967689514

sets (unicode)      : 43.6268708706
regex (unicode)     : 7.32343912125
translate (unicode) : 54.0041439533
replace (unicode)   : 17.4450061321
Run Code Online (Sandbox Code Playgroud)

我的问题是,是否有更快的方法来实现优于正则表达式的unicode(或任何其他方法)的翻译.

ekh*_*oro 6

目前的测试脚本是有缺陷的,因为它不像是喜欢.

为了更公平的比较,所有函数必须使用相同的标点符号集(即所有ascii或所有unicode)运行.

如果做到这一点,正则表达式和替换方法票价多少与全套的Unicode标点符号更糟.

对于完整的unicode,看起来"set"方法是最好的.但是,如果您只想从unicode字符串中删除ascii标点符号,则最好进行编码,转换和解码(取决于输入字符串的长度).

通过在尝试更换之前进行包容测试(取决于弦的精确构成),也可以显着改善"替换"方法.

以下是测试脚本重新哈希的一些示例结果:

$ python2 test.py
running ascii punctuation test...
using byte strings...

set: 0.862006902695
re: 0.17484498024
trans: 0.0207080841064
enc_trans: 0.0206489562988
repl: 0.157525062561
in_repl: 0.213351011276

$ python2 test.py a
running ascii punctuation test...
using unicode strings...

set: 0.927773952484
re: 0.18892288208
trans: 1.58275294304
enc_trans: 0.0794939994812
repl: 0.413739919662
in_repl: 0.249747991562

python2 test.py u
running unicode punctuation test...
using unicode strings...

set: 0.978360176086
re: 7.97941994667
trans: 1.72471117973
enc_trans: 0.0784001350403
repl: 7.05612301826
in_repl: 3.66821289062
Run Code Online (Sandbox Code Playgroud)

这是重新散列的脚本:

# -*- coding: utf-8 -*-

import re, string, timeit
import unicodedata
import sys


#String from this article www.wired.com/design/2013/12/find-the-best-of-reddit-with-this-interactive-map/

s = """For me, Reddit brings to mind Obi Wan’s enduring description of the Mos
Eisley cantina: a wretched hive of scum and villainy. But, you know, one you
still kinda want to hang out in occasionally. The thing is, though, Reddit
isn’t some obscure dive bar in a remote corner of the universe—it’s a huge
watering hole at the very center of it. The site had some 400 million unique
visitors in 2012. They can’t all be Greedos. So maybe my problem is just that
I’ve never been able to find the places where the decent people hang out."""

su = u"""For me, Reddit brings to mind Obi Wan’s enduring description of the
Mos Eisley cantina: a wretched hive of scum and villainy. But, you know, one
you still kinda want to hang out in occasionally. The thing is, though,
Reddit isn’t some obscure dive bar in a remote corner of the universe—it’s a
huge watering hole at the very center of it. The site had some 400 million
unique visitors in 2012. They can’t all be Greedos. So maybe my problem is
just that I’ve never been able to find the places where the decent people
hang out."""

def test_trans(s):
    return s.translate(tbl)

def test_enc_trans(s):
    s = s.encode('utf-8').translate(None, string.punctuation)
    return s.decode('utf-8')

def test_set(s): # with list comprehension fix
    return ''.join([ch for ch in s if ch not in exclude])

def test_re(s):  # From Vinko's solution, with fix.
    return regex.sub('', s)

def test_repl(s):  # From S.Lott's solution
    for c in punc:
        s = s.replace(c, "")
    return s

def test_in_repl(s):  # From S.Lott's solution, with fix
    for c in punc:
        if c in s:
            s = s.replace(c, "")
    return s

txt = 'su'
ptn = u'[%s]'

if 'u' in sys.argv[1:]:
    print 'running unicode punctuation test...'
    print 'using unicode strings...'
    punc = u''
    tbl = {}
    for i in xrange(sys.maxunicode):
        char = unichr(i)
        if unicodedata.category(char).startswith('P'):
            tbl[i] = None
            punc += char
else:
    print 'running ascii punctuation test...'
    punc = string.punctuation
    if 'a' in sys.argv[1:]:
        print 'using unicode strings...'
        punc = punc.decode()
        tbl = {ord(ch):None for ch in punc}
    else:
        print 'using byte strings...'
        txt = 's'
        ptn = '[%s]'
        def test_trans(s):
            return s.translate(None, punc)
        test_enc_trans = test_trans

exclude = set(punc)
regex = re.compile(ptn % re.escape(punc))

def time_func(func, n=10000):
    timer = timeit.Timer(
        'func(%s)' % txt,
        'from __main__ import %s, test_%s  as func' % (txt, func))
    print '%s: %s' % (func, timer.timeit(n))

print
time_func('set')
time_func('re')
time_func('trans')
time_func('enc_trans')
time_func('repl')
time_func('in_repl')
Run Code Online (Sandbox Code Playgroud)