我刚开始使用pyparsing这个晚上,我已经构建了一个复杂的语法,它描述了我正在非常有效地工作的一些来源.它非常简单而且非常强大.但是,我在使用时遇到了一些麻烦ParsedResults.我需要能够按照它们被找到的顺序迭代嵌套的标记,并且我发现它有点令人沮丧.我已将问题抽象为一个简单的案例:
import pyparsing as pp
word = pp.Word(pp.alphas + ',.')('word*')
direct_speech = pp.Suppress('“') + pp.Group(pp.OneOrMore(word))('direct_speech*') + pp.Suppress('”')
sentence = pp.Group(pp.OneOrMore(word | direct_speech))('sentence')
test_string = 'Lorem ipsum “dolor sit” amet, consectetur.'
r = sentence.parseString(test_string)
print r.asXML('div')
print ''
for name, item in r.sentence.items():
print name, item
print ''
for item in r.sentence:
print item.getName(), item.asList()
Run Code Online (Sandbox Code Playgroud)
据我所见,这应该有用吗?这是输出:
<div>
<sentence>
<word>Lorem</word>
<word>ipsum</word>
<direct_speech>
<word>dolor</word>
<word>sit</word>
</direct_speech>
<word>amet,</word>
<word>consectetur.</word>
</sentence>
</div>
word ['Lorem', 'ipsum', 'amet,', 'consectetur.']
direct_speech [['dolor', 'sit']]
Traceback (most recent call last):
File "./test.py", line 27, in <module>
print item.getName(), item.asList()
AttributeError: 'str' object has no attribute 'getName'
Run Code Online (Sandbox Code Playgroud)
XML输出似乎表明字符串的解析完全符合我的意愿,但我无法迭代句子,例如,重构它.
有办法做我需要的吗?
谢谢!
编辑:
我一直在用这个:
for item in r.sentence:
if isinstance(item, basestring):
print item
else:
print item.getName(), item
Run Code Online (Sandbox Code Playgroud)
但它并没有帮助我那么多,因为我无法区分不同类型的字符串.这是一个略有扩展的例子:
word = pp.Word(pp.alphas + ',.')('word*')
number = pp.Word(pp.nums + ',.')('number*')
direct_speech = pp.Suppress('“') + pp.Group(pp.OneOrMore(word | number))('direct_speech*') + pp.Suppress('”')
sentence = pp.Group(pp.OneOrMore(word | number | direct_speech))('sentence')
test_string = 'Lorem 14 ipsum “dolor 22 sit” amet, consectetur.'
r = sentence.parseString(test_string)
for i, item in enumerate(r.sentence):
if isinstance(item, basestring):
print i, item
else:
print i, item.getName(), item
Run Code Online (Sandbox Code Playgroud)
输出是:
0 Lorem
1 14
2 ipsum
3 word ['dolor', '22', 'sit']
4 amet,
5 consectetur.
Run Code Online (Sandbox Code Playgroud)
不太有帮助.我无法区分word和number,并且direct_speech元素被标记了word?!
我显然错过了一些东西.我想做的就是:
for item in r.sentence:
if (item is a number):
do something
elif (item is a word):
do something else
etc. ...
Run Code Online (Sandbox Code Playgroud)
我应该以不同的方式接近这个吗?
r.sentence包含字符串和ParseResults的混合,只有ParseResults支持getName().你试过迭代了r.sentence吗?如果我使用asList()打印出来,我得到:
['Lorem', 'ipsum', ['dolor', 'sit'], 'amet,', 'consectetur.']
Run Code Online (Sandbox Code Playgroud)
或者这个片段:
for item in r.sentence:
print type(item),item.asList() if isinstance(item,pp.ParseResults) else item
Run Code Online (Sandbox Code Playgroud)
得到:
<type 'str'> Lorem
<type 'str'> ipsum
<class 'pyparsing.ParseResults'> ['dolor', 'sit']
<type 'str'> amet,
<type 'str'> consectetur.
Run Code Online (Sandbox Code Playgroud)
我不确定我是否回答了你的问题,但是这是否能说明下一步该怎么做?
(欢迎来到Pyparsing)
好吧,我现在尝试了多种不同的方法,但无法得到我需要的东西,所以(虽然看起来很荒谬),我正在使用.asXML()并解析生成的 XML。这是我的例子:
import pyparsing as pp\n\nword = pp.Word(pp.alphas + ',.')('word*')\nnumber = pp.Word(pp.nums + ',.')('number*')\ndirect_speech = pp.Suppress('\xe2\x80\x9c') + pp.Group(pp.OneOrMore(word | number))('direct_speech*') + pp.Suppress('\xe2\x80\x9d')\nsentence = pp.Group(pp.OneOrMore(word | number | direct_speech))('sentence')\n\ntest_string = 'Lorem 14 ipsum \xe2\x80\x9cdolor 22 sit\xe2\x80\x9d amet, consectetur.'\nr = sentence.parseString(test_string)\n\nfrom lxml import etree\nxml = etree.fromstring(r.sentence.asXML('sentence'))\nfor el in xml:\n if len(el):\n print el.tag\n for sub_el in el:\n print ' ', sub_el.tag, ':', sub_el.text\n else:\n print el.tag, ':', el.text\nRun Code Online (Sandbox Code Playgroud)\n\n其输出:
\n\nword : Lorem\nnumber : 14\nword : ipsum\ndirect_speech\n word : dolor\n number : 22\n word : sit\nword : amet,\nword : consectetur.\nRun Code Online (Sandbox Code Playgroud)\n\n看起来绕房子很远,但似乎没有更好的方法。
\n