如何自动修复无效的JSON字符串?

Ant*_*ski 18 python json escaping

从2gis API我得到以下JSON字符串.

{
    "api_version": "1.3",
    "response_code": "200",
    "id": "3237490513229753",
    "lon": "38.969916127827",
    "lat": "45.069889625267",
    "page_url": null,
    "name": "ATB",
    "firm_group": {
        "id": "3237499103085728",
        "count": "1"
    },
    "city_name": "Krasnodar",
    "city_id": "3237585002430511",
    "address": "Turgeneva,   172/1",
    "create_time": "2008-07-22 10:02:04 07",
    "modification_time": "2013-08-09 20:04:36 07",
    "see_also": [
        {
            "id": "3237491513434577",
            "lon": 38.973110606808,
            "lat": 45.029031222211,
            "name": "Advance",
            "hash": "5698hn745A8IJ1H86177uvgn94521J3464he26763737242Cf6e654G62J0I7878e",
            "ads": {
                "sponsored_article": {
                    "title": "Center "ADVANCE"",
                    "text": "Business.English."
                },
                "warning": null
            }
        }
    ]
}
Run Code Online (Sandbox Code Playgroud)

但Python不承认它:

json.loads(firm_str)
Run Code Online (Sandbox Code Playgroud)

期待,分隔符:第1行第3646行(字符3645)

它看起来像引号中的问题:"标题":"中心"ADVANCE""

如何在Python中自动修复它?

tob*_*s_k 28

@Michael答案给了我一个想法......不是一个非常漂亮的想法,但它似乎有效,至少在你的例子上:尝试解析JSON字符串,如果失败,找到它失败的字符异常字符串1并替换该字符.

while True:
    try:
        result = json.loads(s)   # try to parse...
        break                    # parsing worked -> exit loop
    except Exception as e:
        # "Expecting , delimiter: line 34 column 54 (char 1158)"
        # position of unexpected character after '"'
        unexp = int(re.findall(r'\(char (\d+)\)', str(e))[0])
        # position of unescaped '"' before that
        unesc = s.rfind(r'"', 0, unexp)
        s = s[:unesc] + r'\"' + s[unesc+1:]
        # position of correspondig closing '"' (+2 for inserted '\')
        closg = s.find(r'"', unesc + 2)
        s = s[:closg] + r'\"' + s[closg+1:]
print result
Run Code Online (Sandbox Code Playgroud)

您可能需要添加一些额外的检查,以防止这种情况以无限循环结束(例如,最大重复次数与字符串中的字符一样多).此外,如果不正确"的实际后面跟着一个逗号,这仍然无效,正如@gnibbler所指出的那样.

更新:这似乎现在运行很好(虽然仍然不完美),即使未转义"后跟一个逗号或结束括号,因为在这种情况下,它可能会在此之后得到关于语法错误的抱怨(预期的属性名称)等等)并追溯到最后".它还会自动转义相应的关闭"(假设有一个).


1)异常str"Expecting , delimiter: line XXX column YYY (char ZZZ)",其中ZZZ是发生错误的字符串中的位置.但请注意,此消息可能取决于Python的版本,json模块,操作系统或区域设置,因此可能必须相应地调整此解决方案.


ato*_*757 7

如果这正是API返回的内容,则其API存在问题.这是无效的JSON.特别是在这个地区:

"ads": {
            "sponsored_article": {
                "title": "??????????????? ????? "ADVANCE"", <-- here
                "text": "??????.????????.?????????? ????.?????????? ? ?????.?????????? ? ???."
            },
            "warning": null
        }
Run Code Online (Sandbox Code Playgroud)

ADVANCE周围的双引号不会被转义.您可以使用http://jsonlint.com/之类的东西来验证它.

这是一个"没有被转义的问题,如果这是你得到的,数据在源头是坏的.他们需要解决它.

Parse error on line 4:
...???????????? ????? "ADVANCE"",         
-----------------------^
Expecting '}', ':', ',', ']'
Run Code Online (Sandbox Code Playgroud)

这解决了这个问题:

"title": "??????????????? ????? \"ADVANCE\"",
Run Code Online (Sandbox Code Playgroud)

  • 您应该首先写入`2gis`并告诉他们他们没有返回正确的JSON. (6认同)

Pao*_*olo 6

唯一真正且明确的解决方案是要求 2gis 修复他们的 API。

同时,可以修复编码错误的 JSON 转义字符串内的双引号。如果每个键值对后面都跟有换行符(似乎来自发布的数据),则以下函数将完成这项工作:

def fixjson(badjson):
    s = badjson
    idx = 0
    while True:
        try:
            start = s.index( '": "', idx) + 4
            end1  = s.index( '",\n',idx)
            end2  = s.index( '"\n', idx)
            if end1 < end2:
                end = end1
            else:
                end = end2
            content = s[start:end]
            content = content.replace('"', '\\"')
            s = s[:start] + content + s[end:]
            idx = start + len(content) + 6
        except:
            return s
Run Code Online (Sandbox Code Playgroud)

请注意做出的一些假设:

该函数尝试转义属于键值对的值字符串内的双引号字符。

假设要转义的文本在序列之后开始

": "
Run Code Online (Sandbox Code Playgroud)

并在序列之前结束

",\n
Run Code Online (Sandbox Code Playgroud)

或者

"\n
Run Code Online (Sandbox Code Playgroud)

将发布的 JSON 传递给函数会产生此返回值

{
    "api_version": "1.3",
    "response_code": "200",
    "id": "3237490513229753",
    "lon": "38.969916127827",
    "lat": "45.069889625267",
    "page_url": null,
    "name": "ATB",
    "firm_group": {
        "id": "3237499103085728",
        "count": "1"
    },
    "city_name": "Krasnodar",
    "city_id": "3237585002430511",
    "address": "Turgeneva,   172/1",
    "create_time": "2008-07-22 10:02:04 07",
    "modification_time": "2013-08-09 20:04:36 07",
    "see_also": [
        {
            "id": "3237491513434577",
            "lon": 38.973110606808,
            "lat": 45.029031222211,
            "name": "Advance",
            "hash": "5698hn745A8IJ1H86177uvgn94521J3464he26763737242Cf6e654G62J0I7878e",
            "ads": {
                "sponsored_article": {
                    "title": "Center \"ADVANCE\"",
                    "text": "Business.English."
                },
                "warning": null
            }
        }
    ]
}
Run Code Online (Sandbox Code Playgroud)

请记住,如果您的需求未完全满足,您可以轻松自定义该功能。


the*_*der 5

上面的想法很好,但我有问题。我的 json Sting 中只包含一个额外的双引号。因此,我对上面给出的代码进行了修复。

jsonStr 是

{
    "api_version": "1.3",
    "response_code": "200",
    "id": "3237490513229753",
    "lon": "38.969916127827",
    "lat": "45.069889625267",
    "page_url": null,
    "name": "ATB",
    "firm_group": {
        "id": "3237499103085728",
        "count": "1"
    },
    "city_name": "Krasnodar",
    "city_id": "3237585002430511",
    "address": "Turgeneva,   172/1",
    "create_time": "2008-07-22 10:02:04 07",
    "modification_time": "2013-08-09 20:04:36 07",
    "see_also": [
        {
            "id": "3237491513434577",
            "lon": 38.973110606808,
            "lat": 45.029031222211,
            "name": "Advance",
            "hash": "5698hn745A8IJ1H86177uvgn94521J3464he26763737242Cf6e654G62J0I7878e",
            "ads": {
                "sponsored_article": {
                    "title": "Center "ADVANCE",
                    "text": "Business.English."
                },
                "warning": null
            }
        }
    ]
}
Run Code Online (Sandbox Code Playgroud)

修复如下:

import json, re
def fixJSON(jsonStr):
    # Substitue all the backslash from JSON string.
    jsonStr = re.sub(r'\\', '', jsonStr)
    try:
        return json.loads(jsonStr)
    except ValueError:
        while True:
            # Search json string specifically for '"'
            b = re.search(r'[\w|"]\s?(")\s?[\w|"]', jsonStr)

            # If we don't find any the we come out of loop
            if not b:
                break

            # Get the location of \"
            s, e = b.span(1)
            c = jsonStr[s:e]

            # Replace \" with \'
            c = c.replace('"',"'")
            jsonStr = jsonStr[:s] + c + jsonStr[e:]
        return json.loads(jsonStr)
Run Code Online (Sandbox Code Playgroud)

此代码也适用于问题陈述中提到的 JSON 字符串


或者你也可以这样做:

def fixJSON(jsonStr):
    # First remove the " from where it is supposed to be.
    jsonStr = re.sub(r'\\', '', jsonStr)
    jsonStr = re.sub(r'{"', '{`', jsonStr)
    jsonStr = re.sub(r'"}', '`}', jsonStr)
    jsonStr = re.sub(r'":"', '`:`', jsonStr)
    jsonStr = re.sub(r'":', '`:', jsonStr)
    jsonStr = re.sub(r'","', '`,`', jsonStr)
    jsonStr = re.sub(r'",', '`,', jsonStr)
    jsonStr = re.sub(r',"', ',`', jsonStr)
    jsonStr = re.sub(r'\["', '\[`', jsonStr)
    jsonStr = re.sub(r'"\]', '`\]', jsonStr)

    # Remove all the unwanted " and replace with ' '
    jsonStr = re.sub(r'"',' ', jsonStr)

    # Put back all the " where it supposed to be.
    jsonStr = re.sub(r'\`','\"', jsonStr)

    return json.loads(jsonStr)
Run Code Online (Sandbox Code Playgroud)