jhd*_*023 3 text-processing json jq
我有一个程序的输出,它给出了一些任意文本,其中包含 .json 内容,例如:
blablablabla
blablab some more text
blablablabla
blablab some more text
{
"glossary": {
"title": "example glossary",
"GlossDiv": {
"title": "S",
"GlossList": {
"GlossEntry": {
"ID": "SGML",
"SortAs": "SGML",
"GlossTerm": "Standard Generalized Markup Language",
"Acronym": "SGML",
"Abbrev": "ISO 8879:1986",
"GlossDef": {
"para": "A meta-markup language, used to create markup languages such as DocBook.",
"GlossSeeAlso": ["GML", "XML"]
},
"GlossSee": "markup"
}
}
}
}
}
blablablabla
blablab some more text
blablablabla
blablab some more text
Run Code Online (Sandbox Code Playgroud)
我想清理 .json 之外的文本以用“jq”解析它。
我只需要这段文字:
{
"glossary": {
"title": "example glossary",
"GlossDiv": {
"title": "S",
"GlossList": {
"GlossEntry": {
"ID": "SGML",
"SortAs": "SGML",
"GlossTerm": "Standard Generalized Markup Language",
"Acronym": "SGML",
"Abbrev": "ISO 8879:1986",
"GlossDef": {
"para": "A meta-markup language, used to create markup languages such as DocBook.",
"GlossSeeAlso": ["GML", "XML"]
},
"GlossSee": "markup"
}
}
}
}
}
Run Code Online (Sandbox Code Playgroud)
谢谢!
sed '/^{/,/^}/!d' < input
Run Code Online (Sandbox Code Playgroud)
将提取包含在以 开头{
的行和之后的下一行之间的文件部分}
。
pcregrep -Mo '(?s)(\{(?:[^{}"]++|"(?:\\.|[^"])*+"|(?1))*\})' < file
Run Code Online (Sandbox Code Playgroud)
将提取顶级{...}
s对,无论它们在哪里,都足够聪明以处理像{"x":{"y":1}}
(nested {}
) or { "x}" }
( }
inside strings), or { "x\"}" }
(escaped of strings in strings) 这样的输入。
如果您没有并且无法安装pcregrep
(随 PCRE 库一起提供),但是您有 GNU grep
,使用 PCRE 构建,您可以替换为grep -zo
尽管将整个文件加载到内存中。或使用perl -l -0777 -ne 'print for m{regexp-above}g'
.