Pyparsing:将半JSON嵌套明文数据解析为列表

Question

Pyparsing:将半JSON嵌套明文数据解析为列表

Jos*_*ker 3 python parsing json plaintext pyparsing

我有一堆嵌套数据,其格式与JSON类似:

company="My Company"
phone="555-5555"
people=
{
    person=
    {
        name="Bob"
        location="Seattle"
        settings=
        {
            size=1
            color="red"
        }
    }
    person=
    {
        name="Joe"
        location="Seattle"
        settings=
        {
            size=2
            color="blue"
        }
    }
}
places=
{
    ...
}

Run Code Online (Sandbox Code Playgroud)

有许多不同的参数,不同的深度水平 - 这只是一个非常小的子集.

值得注意的是,当创建一个新的子数组时,总会有一个等号后面跟一个换行符(如上图所示).

是否有任何简单的循环或递归技术将这些数据转换为系统友好的数据格式,如数组或JSON？我想避免硬编码属性的名称.我正在寻找可以在Python,Java或PHP中运行的东西.伪代码也很好.

我感谢任何帮助.

编辑:我发现了Python的Pyparsing库,看起来它可能是一个很大的帮助.我找不到任何关于如何使用Pyparsing来解析未知深度的嵌套结构的示例.任何人都可以根据我上面描述的数据阐明Pyparsing吗？

编辑2:好的,这是Pyparsing中一个有效的解决方案:

def parse_file(fileName):

#get the input text file
file = open(fileName, "r")
inputText = file.read()

#define the elements of our data pattern
name = Word(alphas, alphanums+"_")
EQ,LBRACE,RBRACE = map(Suppress, "={}")
value = Forward() #this tells pyparsing that values can be recursive
entry = Group(name + EQ + value) #this is the basic name-value pair


#define data types that might be in the values
real = Regex(r"[+-]?\d+\.\d*").setParseAction(lambda x: float(x[0]))
integer = Regex(r"[+-]?\d+").setParseAction(lambda x: int(x[0]))
quotedString.setParseAction(removeQuotes)

#declare the overall structure of a nested data element
struct = Dict(LBRACE + ZeroOrMore(entry) + RBRACE) #we will turn the output into a Dictionary

#declare the types that might be contained in our data value - string, real, int, or the struct we declared
value << (quotedString | struct | real | integer)

#parse our input text and return it as a Dictionary
result = Dict(OneOrMore(entry)).parseString(inputText)
return result.dump()

Run Code Online (Sandbox Code Playgroud)

这有效,但是当我尝试使用json.dump(result)将结果写入文件时,文件的内容用双引号括起来.此外,\n许多数据对之间都有咒语.我尝试在上面的代码中使用它们来抑制它们LineEnd().suppress(),但我一定不能正确使用它.

好吧,我想出了一个最终的解决方案,它实际上将这些数据转换为我最初想要的JSON友好的Dict.它首先使用Pyparsing将数据转换为一系列嵌套列表,然后遍历列表并将其转换为JSON.这使我能够克服Pyparsing的toDict()方法无法处理同一对象具有两个同名属性的问题.要确定列表是普通列表还是属性/值对,该prependPropertyToken方法会__property__在Pyparsing检测到时在属性名称前面添加字符串.

def parse_file(self,fileName):

            #get the input text file
            file = open(fileName, "r")
            inputText = file.read()


            #define data types that might be in the values
            real = Regex(r"[+-]?\d+\.\d*").setParseAction(lambda x: float(x[0]))
            integer = Regex(r"[+-]?\d+").setParseAction(lambda x: int(x[0]))
            yes = CaselessKeyword("yes").setParseAction(replaceWith(True))
            no = CaselessKeyword("no").setParseAction(replaceWith(False))
            quotedString.setParseAction(removeQuotes)
            unquotedString =  Word(alphanums+"_-?\"")
            comment = Suppress("#") + Suppress(restOfLine)
            EQ,LBRACE,RBRACE = map(Suppress, "={}")

            data = (real | integer | yes | no | quotedString | unquotedString)

            #define structures
            value = Forward()
            object = Forward() 

            dataList = Group(OneOrMore(data))
            simpleArray = (LBRACE + dataList + RBRACE)

            propertyName = Word(alphanums+"_-.").setParseAction(self.prependPropertyToken)
            property = dictOf(propertyName + EQ, value)
            properties = Dict(property)

            object << (LBRACE + properties + RBRACE)
            value << (data | object | simpleArray)

            dataset = properties.ignore(comment)

            #parse it
            result = dataset.parseString(inputText)

            #turn it into a JSON-like object
            dict = self.convert_to_dict(result.asList())
            return json.dumps(dict)



    def convert_to_dict(self, inputList):
            dict = {}
            for item in inputList:
                    #determine the key and value to be inserted into the dict
                    dictval = None
                    key = None

                    if isinstance(item, list):
                            try:
                                    key = item[0].replace("__property__","")
                                    if isinstance(item[1], list):
                                            try:
                                                    if item[1][0].startswith("__property__"):
                                                            dictval = self.convert_to_dict(item)
                                                    else:
                                                            dictval = item[1]
                                            except AttributeError:
                                                    dictval = item[1]
                                    else:
                                            dictval = item[1]
                            except IndexError:
                                    dictval = None
                    #determine whether to insert the value into the key or to merge the value with existing values at this key
                    if key:
                            if key in dict:
                                    if isinstance(dict[key], list):
                                            dict[key].append(dictval)
                                    else:
                                            old = dict[key]
                                            new = [old]
                                            new.append(dictval)
                                            dict[key] = new
                            else:
                                    dict[key] = dictval
            return dict



    def prependPropertyToken(self,t):
            return "__property__" + t[0]

Run Code Online (Sandbox Code Playgroud)

Answer 1

Pau*_*McG 5

通过使用Forward类定义占位符来保存嵌套部分,可以通过pyparsing解析任意嵌套结构.在这种情况下,您只是解析简单的名称 - 值对,其中value本身可以是包含名称 - 值对的嵌套结构.

name :: word of alphanumeric characters
entry :: name '=' value
struct :: '{' entry* '}'
value :: real | integer | quotedstring | struct

Run Code Online (Sandbox Code Playgroud)

这意味着几乎逐字逐句的pypars.要定义可以递归地包含值的值,我们首先创建一个Forward()占位符,它可以用作条目定义的一部分.然后,一旦我们定义了所有可能的值类型,我们使用'<<'运算符将此定义插入到值表达式中:

EQ,LBRACE,RBRACE = map(Suppress,"={}")

name = Word(alphas, alphanums+"_")
value = Forward()
entry = Group(name + EQ + value)

real = Regex(r"[+-]?\d+\.\d*").setParseAction(lambda x: float(x[0]))
integer = Regex(r"[+-]?\d+").setParseAction(lambda x: int(x[0]))
quotedString.setParseAction(removeQuotes)

struct = Group(LBRACE + ZeroOrMore(entry) + RBRACE)
value << (quotedString | struct | real | integer)

Run Code Online (Sandbox Code Playgroud)

对real和integer的解析操作将在解析时将这些元素从字符串转换为float或ints,以便在解析后立即将值用作实际类型(无需后处理以执行字符串到其他 - 类型转换).

您的示例是一个或多个条目的集合,因此我们使用它来解析总输入:

result = OneOrMore(entry).parseString(sample)

Run Code Online (Sandbox Code Playgroud)

我们可以将解析后的数据作为嵌套列表进行访问,但显示效果并不是很好.此代码使用pprint来打印格式化的嵌套列表:

from pprint import pprint
pprint(result.asList())

Run Code Online (Sandbox Code Playgroud)

赠送:

[['company', 'My Company'],
 ['phone', '555-5555'],
 ['people',
  [['person',
    [['name', 'Bob'],
     ['location', 'Seattle'],
     ['settings', [['size', 1], ['color', 'red']]]]],
   ['person',
    [['name', 'Joe'],
     ['location', 'Seattle'],
     ['settings', [['size', 2], ['color', 'blue']]]]]]]]

Run Code Online (Sandbox Code Playgroud)

请注意,所有字符串都只是没有括号引号的字符串,而int是实际的int.

通过认识到条目格式实际上定义了一个适合像Python字典一样访问的名称 - 值对,我们可以做得比这更好一些.我们的解析器可以通过一些小的改动来做到这一点:

将结构定义更改为:

struct = Dict(LBRACE + ZeroOrMore(entry) + RBRACE)

Run Code Online (Sandbox Code Playgroud)

和整个解析器:

result = Dict(OneOrMore(entry)).parseString(sample)

Run Code Online (Sandbox Code Playgroud)

Dict类将解析的内容视为名称后跟一个值,可以递归地完成.通过这些更改,我们现在可以访问结果中的数据,如dict中的元素:

print result['phone']

Run Code Online (Sandbox Code Playgroud)

或者像对象中的属性:

print result.company

Run Code Online (Sandbox Code Playgroud)

使用dump()方法查看结构或子结构的内容:

for person in result.people:
    print person.dump()
    print

Run Code Online (Sandbox Code Playgroud)

打印:

['person', ['name', 'Bob'], ['location', 'Seattle'], ['settings', ['size', 1], ['color', 'red']]]
- location: Seattle
- name: Bob
- settings: [['size', 1], ['color', 'red']]
  - color: red
  - size: 1

['person', ['name', 'Joe'], ['location', 'Seattle'], ['settings', ['size', 2], ['color', 'blue']]]
- location: Seattle
- name: Joe
- settings: [['size', 2], ['color', 'blue']]
  - color: blue
  - size: 2

Run Code Online (Sandbox Code Playgroud)

归档时间：	12 年前
查看次数：	1601 次
最近记录：	11 年，11 月前