Python - pyparsing unicode字符

Question

Python - pyparsing unicode字符

bod*_*tva 12 python unicode nlp pyparsing

:)我尝试使用w = Word(printables),但它无法正常工作.我应该如何给出这个规范.'w'用于处理印地语字符(UTF-8)

代码指定语法并相应地解析.

671.assess  :: ?????  ::2
x=number + "." + src + "::" + w + "::" + number + "." + number

Run Code Online (Sandbox Code Playgroud)

如果它只有英文字符,那么代码对于ascii格式是正确的,但代码不适用于unicode格式.

我的意思是当我们有671.assess :: ahsaas :: 2形式的代码时代码可以工作

即它解析英文格式的单词,但我不知道如何解析然后以unicode格式打印字符.我需要这个用于英语印地语单词对齐的目的.

python代码如下所示:

# -*- coding: utf-8 -*-
from pyparsing import Literal, Word, Optional, nums, alphas, ZeroOrMore, printables , Group , alphas8bit , 
# grammar 
src = Word(printables)
trans =  Word(printables)
number = Word(nums)
x=number + "." + src + "::" + trans + "::" + number + "." + number
#parsing for eng-dict
efiledata = open('b1aop_or_not_word.txt').read()
eresults = x.parseString(efiledata)
edict1 = {}
edict2 = {}
counter=0
xx=list()
for result in eresults:
  trans=""#translation string
  ew=""#english word
  xx=result[0]
  ew=xx[2]
  trans=xx[4]   
  edict1 = { ew:trans }
  edict2.update(edict1)
print len(edict2) #no of entries in the english dictionary
print "edict2 has been created"
print "english dictionary" , edict2 

#parsing for hin-dict
hfiledata = open('b1aop_or_not_word.txt').read()
hresults = x.scanString(hfiledata)
hdict1 = {}
hdict2 = {}
counter=0
for result in hresults:
  trans=""#translation string
  hw=""#hin word
  xx=result[0]  
  hw=xx[2]
  trans=xx[4]
  #print trans
  hdict1 = { trans:hw }
  hdict2.update(hdict1)

print len(hdict2) #no of entries in the hindi dictionary
print"hdict2 has been created"
print "hindi dictionary" , hdict2
'''
#######################################################################################################################

def translate(d, ow, hinlist):
   if ow in d.keys():#ow=old word d=dict
    print ow , "exists in the dictionary keys"
        transes = d[ow]
    transes = transes.split()
        print "possible transes for" , ow , " = ", transes
        for word in transes:
            if word in hinlist:
        print "trans for" , ow , " = ", word
                return word
        return None
   else:
        print ow , "absent"
        return None

f = open('bidir','w')
#lines = ["'\
#5# 10 # and better performance in business in turn benefits consumers .  # 0 0 0 0 0 0 0 0 0 0 \
#5# 11 # vHyaapaar mEmn bEhtr kaam upbhOkHtaaomn kE lIe laabhpHrdd hOtaa hAI .  # 0 0 0 0 0 0 0 0 0 0 0 \
#'"]
data=open('bi_full_2','rb').read()
lines = data.split('!@#$%')
loc=0
for line in lines:
    eng, hin = [subline.split(' # ')
                for subline in line.strip('\n').split('\n')]

    for transdict, source, dest in [(edict2, eng, hin),
                                    (hdict2, hin, eng)]:
        sourcethings = source[2].split()
        for word in source[1].split():
            tl = dest[1].split()
            otherword = translate(transdict, word, tl)
            loc = source[1].split().index(word)
            if otherword is not None:
                otherword = otherword.strip()
                print word, ' <-> ', otherword, 'meaning=good'
                if otherword in dest[1].split():
                    print word, ' <-> ', otherword, 'trans=good'
                    sourcethings[loc] = str(
                        dest[1].split().index(otherword) + 1)

        source[2] = ' '.join(sourcethings)

    eng = ' # '.join(eng)
    hin = ' # '.join(hin)
    f.write(eng+'\n'+hin+'\n\n\n')
f.close()
'''

Run Code Online (Sandbox Code Playgroud)

如果源文件的示例输入句子是:

1# 5 # modern markets : confident consumers  # 0 0 0 0 0 
1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa .  # 0 0 0 0 0 0 
!@#$%

Run Code Online (Sandbox Code Playgroud)

ouptut看起来像这样: -

1# 5 # modern markets : confident consumers  # 1 2 3 4 5 
1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa .  # 1 2 3 4 5 0 
!@#$%

Run Code Online (Sandbox Code Playgroud)

输出说明: - 实现双向对齐.这意味着英语"现代"的第一个单词映射到印地语"AddhUnIk"的第一个单词,反之亦然.这里甚至将字符视为单词,因为它们也是双向映射的组成部分.因此,如果你观察印地文词''.具有空对齐,并且它与英语句子无关,因为它没有句号.当我们为你试图实现双向映射的许多句子工作时,输出中的第3行基本上代表一个分隔符.

如果我有Unicode(UTF-8)格式的印地语句子,我应该做些什么修改才能工作.

Answer 1

Pau*_*McG 27

Pyparsing printables仅处理ASCII字符范围内的字符串.您希望在完整的Unicode范围内使用printables,如下所示:

unicodePrintables = u''.join(unichr(c) for c in xrange(sys.maxunicode) 
                                        if not unichr(c).isspace())

Run Code Online (Sandbox Code Playgroud)

现在,您可以trans使用这组更完整的非空格字符进行定义:

trans = Word(unicodePrintables)

Run Code Online (Sandbox Code Playgroud)

我无法测试你的印地语测试字符串,但我认为这将成功.

(如果你使用的是Python 3,那么没有单独的unichr函数,也没有xrange生成器,只需使用:

unicodePrintables = ''.join(chr(c) for c in range(sys.maxunicode) 
                                        if not chr(c).isspace())

Run Code Online (Sandbox Code Playgroud)

编辑:

随着近期pyparsing 2.3.0释放,新的命名空间的类都被定义为给printables,alphas,nums,和alphanums各种Unicode的范围.

import pyparsing as pp
pp.Word(pp.pyparsing_unicode.printables)
pp.Word(pp.pyparsing_unicode.Devanagari.printables)
pp.Word(pp.pyparsing_unicode.????????.printables)

Run Code Online (Sandbox Code Playgroud)

@flyingsheep - 很好的提示,更新为使用`sys.maxunicode`而不是硬编码常量,因此它将跟踪Python的`sys`模块.至于循环所有内容,这个位只运行一次,最初定义一个解析器,当用于创建一个pyparsing`Word`时,存储为set(),因此解析时性能仍然很好. (2认同)

Answer 2

Ale*_*lli 7

作为一般规则,也没有处理编码的字节串:让他们到适当的Unicode字符串(通过调用它们的.decode方法)尽快,做你的处理总是Unicode字符串,然后,如果你要为I/O的目的,.encode它们回到你需要的任何字节串编码.

如果你在谈论文字,因为它似乎你在你的代码中,"尽快"是一次:使用u'...'来表达你的文字.在更一般的情况下,你被迫以编码形式进行I/O,它会在输入后立即进行(就像你需要以特定的编码形式执行输出时一样,在输出之前).

归档时间：	15 年，8 月前
查看次数：	6006 次
最近记录：	6 年前