从具有任意文本的文本文件中提取 .json

jhd*_*023 3 text-processing json jq

我有一个程序的输出,它给出了一些任意文本,其中包含 .json 内容,例如:

blablablabla
blablab some more text

blablablabla
blablab some more text
{
    "glossary": {
        "title": "example glossary",
        "GlossDiv": {
            "title": "S",
            "GlossList": {
                "GlossEntry": {
                    "ID": "SGML",
                    "SortAs": "SGML",
                    "GlossTerm": "Standard Generalized Markup Language",
                    "Acronym": "SGML",
                    "Abbrev": "ISO 8879:1986",
                    "GlossDef": {
                        "para": "A meta-markup language, used to create markup languages such as DocBook.",
                        "GlossSeeAlso": ["GML", "XML"]
                    },
                    "GlossSee": "markup"
                }
            }
        }
    }
}


blablablabla
blablab some more text


blablablabla
blablab some more text
Run Code Online (Sandbox Code Playgroud)

我想清理 .json 之外的文本以用“jq”解析它。

我只需要这段文字:

{
    "glossary": {
        "title": "example glossary",
        "GlossDiv": {
            "title": "S",
            "GlossList": {
                "GlossEntry": {
                    "ID": "SGML",
                    "SortAs": "SGML",
                    "GlossTerm": "Standard Generalized Markup Language",
                    "Acronym": "SGML",
                    "Abbrev": "ISO 8879:1986",
                    "GlossDef": {
                        "para": "A meta-markup language, used to create markup languages such as DocBook.",
                        "GlossSeeAlso": ["GML", "XML"]
                    },
                    "GlossSee": "markup"
                }
            }
        }
    }
}
Run Code Online (Sandbox Code Playgroud)

谢谢!

Sté*_*las 5

sed '/^{/,/^}/!d' < input
Run Code Online (Sandbox Code Playgroud)

将提取包含在以 开头{的行和之后的下一行之间的文件部分}

pcregrep -Mo '(?s)(\{(?:[^{}"]++|"(?:\\.|[^"])*+"|(?1))*\})' < file
Run Code Online (Sandbox Code Playgroud)

将提取顶级{...}s对,无论它们在哪里,都足够聪明以处理像{"x":{"y":1}}(nested {}) or { "x}" }( }inside strings), or { "x\"}" }(escaped of strings in strings) 这样的输入。

如果您没有并且无法安装pcregrep(随 PCRE 库一起提供),但是您有 GNU grep,使用 PCRE 构建,您可以替换为grep -zo尽管将整个文件加载到内存中。或使用perl -l -0777 -ne 'print for m{regexp-above}g'.