如何从文本文件中解析特定的 id?

Kum*_*mar 1 shell-script text-processing json

我有一个很长的文本文件,部分文件内容如下所示,

[{"site":"1a2v_1","pfam":"Cu_amine_oxid","uniprot":"P12807"},{"site":"1a2v_2","pfam":"Cu_amine_oxid","uniprot":"P12807"},{"site":"1a2v_3","pfam":"Cu_amine_oxid","uniprot":"T12807"},{"site":"1a2v_4","pfam":"Cu_amine_oxid","uniprot":"P12808"},{"site":"1a2v_5","pfam":"Cu_amine_oxid","uniprot":"Z12809"},{"site":"1a2v_6","pfam":"Cu_amine_oxid","uniprot":"P12821"},{"site":"1a3z_1","pfam":"Copper-bind,SoxE","uniprot":"P0C918"},
Run Code Online (Sandbox Code Playgroud)

我需要uniprot从上面的文本文件中解析ids,下面给出了预期的结果,

P12807
P12807
T12807
P12808
Z12809
P12821
P0C918
Run Code Online (Sandbox Code Playgroud)

为了做到这一点,我尝试了以下命令,但对我没有任何作用,

sed -e 's/"uniprot":"\(.*\)"},{"site":"/\1/' file.txt
cat file.txt | sed 's/.*"uniprot":" //' | sed 's/"site":".*$//'
Run Code Online (Sandbox Code Playgroud)

请帮我解析上面提到的 id。

提前致谢。

ter*_*don 12

如果你在 Linux 系统上,你可以很容易地做到:

$ grep -oP '"uniprot":"\K[^"]+' file
P12807
P12807
T12807
P12808
Z12809
P12821
P0C918
Run Code Online (Sandbox Code Playgroud)

-o讲述grep仅打印每条线的匹配部分和-P使Perl兼容的正则表达式。正则表达式正在寻找"uniprot":"但随后将其丢弃(\K意思是“丢弃迄今为止匹配的任何内容”,因此它不包含在输出中)。然后,您只需寻找最长的非"( [^"]+) 段。


当然,这看起来像 JSON 数据,因此对于任何更复杂的数据,您应该使用适当的解析器,例如jq. 如果您通过添加关闭来修复文件]并使其如下所示:

[{"site":"1a2v_1","pfam":"Cu_amine_oxid","uniprot":"P12807"},{"site":"1a2v_2","pfam":"Cu_amine_oxid","uniprot":"P12807"},{"site":"1a2v_3","pfam":"Cu_amine_oxid","uniprot":"T12807"},{"site":"1a2v_4","pfam":"Cu_amine_oxid","uniprot":"P12808"},{"site":"1a2v_5","pfam":"Cu_amine_oxid","uniprot":"Z12809"},{"site":"1a2v_6","pfam":"Cu_amine_oxid","uniprot":"P12821"},{"site":"1a3z_1","pfam":"Copper-bind,SoxE","uniprot":"P0C918"}]
Run Code Online (Sandbox Code Playgroud)

你可以做:

$ jq -r '.[].uniprot' file
P12807
P12807
T12807
P12808
Z12809
P12821
P0C918
Run Code Online (Sandbox Code Playgroud)

  • 如果任何条目缺少 `uniprot` 键,您将获得 `null` 值。要跳过这些,请使用 `.[].uniprot // empty`。 (2认同)