使用Python快速翻译阿拉伯语文本

Sab*_*bba 5 python arabic

我总是处理阿拉伯语文本文件并避免编码问题我根据Buckwalter的方案将阿拉伯字符音译成英文(http://www.qamus.org/transliteration.htm)

这是我的代码,但即使像400 kb这样的小文件,它也很慢.想让它更快?

谢谢

     def transliterate(file):
          data = open(file).read()
          buckArab = {"'":"?", "|":"?", "?":"?", "&":"?", "<":"?", "}":"?", "A":"?", "b":"?", "p":"?", "t":"?", "v":"?", "g":"?", "H":"?", "x":"?", "d":"?", "*":"?", "r":"?", "z":"?", "s":"?", "$":"?", "S":"?", "D":"?", "T":"?", "Z":"?", "E":"?", "G":"?", "_":"?", "f":"?", "q":"?", "k":"?", "l":"?", "m":"?", "n":"?", "h":"?", "w":"?", "Y":"?", "y":"?", "F":"?", "N":"?", "K":"?", "~":"?", "o":"?", "u":"?", "a":"?", "i":"?"}    
          for char in data: 
               for k, v in arabBuck.iteritems():
                     data = data.replace(k,v)                 
      return data
Run Code Online (Sandbox Code Playgroud)

Bak*_*riu 5

每当您必须进行音译时str.translate,使用的方法是:

>>> import timeit
>>> buckArab = {"'":"?", "|":"?", "?":"?", "&":"?", "<":"?", "}":"?", "A":"?", "b":"?", "p":"?", "t":"?", "v":"?", "g":"?", "H":"?", "x":"?", "d":"?", "*":"?", "r":"?", "z":"?", "s":"?", "$":"?", "S":"?", "D":"?", "T":"?", "Z":"?", "E":"?", "G":"?", "_":"?", "f":"?", "q":"?", "k":"?", "l":"?", "m":"?", "n":"?", "h":"?", "w":"?", "Y":"?", "y":"?", "F":"?", "N":"?", "K":"?", "~":"?", "o":"?", "u":"?", "a":"?", "i":"?"}
>>> def repl(data, table):
...     for k,v in table.iteritems():
...         data = data.replace(k, v)
... 
>>> def trans(data, table):
...     return data.translate(table)
... 
>>> T = u'This is a test to see how fast is translitteration'
>>> timeit.timeit('trans(T, buckArab)', 'from __main__ import trans, T, buckArab', number=10**6)
6.766200065612793
>>> T = 'This is a test to see how fast is translitteration' #in python2 requires ASCII string
>>> timeit.timeit('repl(T, buckArab)', 'from __main__ import repl, T, buckArab', number=10**6)
12.668706893920898
Run Code Online (Sandbox Code Playgroud)

正如您所看到的,即使是小字符串str.translate也快 2 倍。


lar*_*dia 5

顺便说一下,有人已经编写了一个脚本来执行此操作,因此您可能需要在自己花费太多时间之前检查一下: buckwalter2unicode.py

它可能比你需要的更多,但是你不必全部使用它:我只复制了两个词典和transliterateString函数(我认为有一些调整),并在我的网站上使用它.

编辑: 上面的脚本就是我一直在使用,但我只是发现它是比使用替代,尤其是对大型语料库慢.这是我最终得到的代码,这似乎更简单,更快(这引用了字典buck2uni):

def transString(string, reverse=0):
    '''Given a Unicode string, transliterate into Buckwalter. To go from
    Buckwalter back to Unicode, set reverse=1'''

    for k, v in buck2uni.items():
        if not reverse:
            string = string.replace(v, k)
        else:
            string = string.replace(k, v)

    return string
Run Code Online (Sandbox Code Playgroud)


Ryn*_*ett 5

每当我str.translate在 unicode 对象上使用时,它都会返回完全相同的对象。也许这是由于Martijn Peters 提到的行为变化所致

\n\n

如果其他人正在努力将 unicode(例如阿拉伯语)音译为 ascii,我发现将序数映射到 unicode 文字效果很好。

\n\n
>>> buckArab = {"\'":"\xd8\xa1", "|":"\xd8\xa2", "?":"\xd8\xa3", "&":"\xd8\xa4", "<":"\xd8\xa5", "}":"\xd8\xa6", "A":"\xd8\xa7", "b":"\xd8\xa8", "p":"\xd8\xa9", "t":"\xd8\xaa", "v":"\xd8\xab", "g":"\xd8\xac", "H":"\xd8\xad", "x":"\xd8\xae", "d":"\xd8\xaf", "*":"\xd8\xb0", "r":"\xd8\xb1", "z":"\xd8\xb2", "s":"\xd8\xb3", "$":"\xd8\xb4", "S":"\xd8\xb5", "D":"\xd8\xb6", "T":"\xd8\xb7", "Z":"\xd8\xb8", "E":"\xd8\xb9", "G":"\xd8\xba", "_":"\xd9\x80", "f":"\xd9\x81", "q":"\xd9\x82", "k":"\xd9\x83", "l":"\xd9\x84", "m":"\xd9\x85", "n":"\xd9\x86", "h":"\xd9\x87", "w":"\xd9\x88", "Y":"\xd9\x89", "y":"\xd9\x8a", "F":"\xd9\x8b", "N":"\xd9\x8c", "K":"\xd9\x8d", "~":"\xd9\x91", "o":"\xd9\x92", "u":"\xd9\x8f", "a":"\xd9\x8e", "i":"\xd9\x90"}\n>>> ordbuckArab = {ord(v.decode(\'utf8\')): unicode(k) for (k, v) in buckArab.iteritems()}\n>>> ordbuckArab\n{1569: u"\'", 1570: u\'|\', 1571: u\'?\', 1572: u\'&\', 1573: u\'<\', 1574: u\'}\', 1575: u\'A\', 1576: u\'b\', 1577: u\'p\', 1578: u\'t\', 1579: u\'v\', 1580: u\'g\', 1581: u\'H\', 1582: u\'x\', 1583: u\'d\', 1584: u\'*\', 1585: u\'r\', 1586: u\'z\', 1587: u\'s\', 1588: u\'$\', 1589: u\'S\', 1590: u\'D\', 1591: u\'T\', 1592: u\'Z\', 1593: u\'E\', 1594: u\'G\', 1600: u\'_\', 1601: u\'f\', 1602: u\'q\', 1603: u\'k\', 1604: u\'l\', 1605: u\'m\', 1606: u\'n\', 1607: u\'h\', 1608: u\'w\', 1609: u\'Y\', 1610: u\'y\', 1611: u\'F\', 1612: u\'N\', 1613: u\'K\', 1614: u\'a\', 1615: u\'u\', 1616: u\'i\', 1617: u\'~\', 1618: u\'o\'}\n>>> u\'\xd8\xb7\xd8\xb9\xd8\xb5\xd8\xb7\'.translate(ordbuckArab)\nu\'TEST\'\n
Run Code Online (Sandbox Code Playgroud)\n