在一个非常大的文件中搜索和替换字符串

dat*_*789 6 perl awk json data-manipulation large-files

我首选shell命令来完成任务.我有一个非常非常大的文件 - 大约2.8 GB,内容是JSON的内容.一切都在一条线上,我被告知那里至少有150万条记录.

我必须准备文件以供消费.每条记录必须独立.样品:

{"RomanCharacters":{"Alphabet":[{"RecordId":"1",...]},{"RecordId":"2",...},{"RecordId":"3",...},{"RecordId":"4",...},{"RecordId":"5",...} }}
Run Code Online (Sandbox Code Playgroud)

或者,使用以下内容......

{"Accounts":{"Customer":[{"AccountHolderId":"9c585258-c94c-442b-a2f0-1ebbcc274795","Title":"Mrs","Forename":"Tina","Surname":"Wright","DateofBirth":"1988-01-01","Contact":[{"Contact_Info":"9168777943","TypeId":"Mobile Number","PrimaryFlag":"No","Index":"1","Superseded":"No" },{"Contact_Info":"9503588153","TypeId":"Home Telephone","PrimaryFlag":"Yes","Index":"2","Superseded":"Yes" },{"Contact_Info":"acne.pimple@microchimerism.com","TypeId":"Email Address","PrimaryFlag":"No","Index":"3","Superseded":"No" },{"Contact_Info":"swati.singh@microchimerism.com","TypeId":"Email Address","PrimaryFlag":"Yes","Index":"4","Superseded":"Yes" }, {"Contact_Info":"christian.bale@hollywood.com","TypeId":"Email Address","PrimaryFlag":"No","Index":"5","Superseded":"NO" },{"Contact_Info":"15482475584","TypeId":"Mobile_Phone","PrimaryFlag":"No","Index":"6","Superseded":"No" }],"Address":[{"AddressPtr":"5","Line1":"Flat No.14","Line2":"Surya Estate","Line3":"Baner","Line4":"Pune ","Line5":"new","Addres_City":"pune","Country":"India","PostCode":"AB100KP","PrimaryFlag":"No","Superseded":"No"},{"AddressPtr":"6","Line1":"A-602","Line2":"Viva Vadegiri","Line3":"Virar","Line4":"new","Line5":"banglow","Addres_City":"Mumbai","Country":"India","PostCode":"AB10V6T","PrimaryFlag":"Yes","Superseded":"Yes"}],"Account":[{"Field_A":"6884133655531279","Field_B":"887.07","Field_C":"A Loan Product",...,"FieldY_":"2015-09-18","Field_Z":"24275627"}]},{"AccountHolderId":"92a5788f-cd8f-423d-ae5f-4eb0ceb457fd","_Title":"Dr","_Forename":"Christopher","_Surname":"Carroll","_DateofBirth":"1977-02-02","Contact":[{"Contact_Info":"9168777943","TypeId":"Mobile Number","PrimaryFlag":"No","Index":"7","Superseded":"No" },{"Contact_Info":"9503588153","TypeId":"Home Telephone","PrimaryFlag":"Yes","Index":"8","Superseded":"Yes" },{"Contact_Info":"acne.pimple@microchimerism.com","TypeId":"Email Address","PrimaryFlag":"No","Index":"9","Superseded":"No" },{"Contact_Info":"swati.singh@microchimerism.com","TypeId":"Email Address","PrimaryFlag":"Yes","Index":"10","Superseded":"Yes" }],"Address":[{"AddressPtr":"11","Line1":"Flat No.14","Line2":"Surya Estate","Line3":"Baner","Line4":"Pune ","Line5":"new","Addres_City":"pune","Country":"India","PostCode":"AB11TXF","PrimaryFlag":"No","Superseded":"No"},{"AddressPtr":"12","Line1":"A-602","Line2":"Viva Vadegiri","Line3":"Virar","Line4":"new","Line5":"banglow","Addres_City":"Mumbai","Country":"India","PostCode":"AB11O8W","PrimaryFlag":"Yes","Superseded":"Yes"}],"Account":[{"Field_A":"4121879819185553","Field_B":"887.07","Field_C":"A Loan Product",...,"Field_X":"2015-09-18","Field_Z":"25679434"}]},{"AccountHolderId":"4aa10284-d9aa-4dc0-9652-70f01d22b19e","_Title":"Dr","_Forename":"Cheryl","_Surname":"Ortiz","_DateofBirth":"1977-03-03","Contact":[{"Contact_Info":"9168777943","TypeId":"Mobile Number","PrimaryFlag":"No","Index":"13","Superseded":"No" },{"Contact_Info":"9503588153","TypeId":"Home Telephone","PrimaryFlag":"Yes","Index":"14","Superseded":"Yes" },{"Contact_Info":"acne.pimple@microchimerism.com","TypeId":"Email Address","PrimaryFlag":"No","Index":"15","Superseded":"No" },{"Contact_Info":"swati.singh@microchimerism.com","TypeId":"Email Address","PrimaryFlag":"Yes","Index":"16","Superseded":"Yes" }],"Address":[{"AddressPtr":"17","Line1":"Flat No.14","Line2":"Surya Estate","Line3":"Baner","Line4":"Pune ","Line5":"new","Addres_City":"pune","Country":"India","PostCode":"AB12SQR","PrimaryFlag":"No","Superseded":"No"},{"AddressPtr":"18","Line1":"A-602","Line2":"Viva Vadegiri","Line3":"Virar","Line4":"new","Line5":"banglow","Addres_City":"Mumbai","Country":"India","PostCode":"AB12BAQ","PrimaryFlag":"Yes","Superseded":"Yes"}],"Account":[{"Field_A":"3288214945919484","Field_B":"887.07","Field_C":"A Loan Product",...,"Field_Y":"2015-09-18","Field_Z":"66264768"}]}]}}
Run Code Online (Sandbox Code Playgroud)

最终结果应该是:

{"RomanCharacters":{"Alphabet":[{"RecordId":"1",...]},
{"RecordId":"2",...},
{"RecordId":"3",...},
{"RecordId":"4",...},
{"RecordId":"5",...} }}
Run Code Online (Sandbox Code Playgroud)

尝试的命令:

  • sed -e 's/,{"RecordId"/}]},\n{"RecordId"/g' sample.dat
  • awk '{gsub(",{\"RecordId\"",",\n{\"RecordId\"",$0); print $0}' sample.dat

尝试的命令对于小文件非常适用.但它不适用于我必须操作的2.8 GB文件.在没有任何理由的情况下,塞德在10分钟后中途退出,没有做任何事情.经过多个小时后,awk出现了分段错误(核心转储)的原因.我尝试了perl的搜索和替换,并收到错误消息"Out of memory".

任何帮助/想法都会很棒!

我的机器上的其他信息:

  • 超过105 GB的磁盘空间.
  • 8 GB内存
  • 4核CPU
  • 运行Ubuntu 14.04

neu*_*aus 3

关于perl:尝试将输入行分隔符设置$/},如下:

#!/usr/bin/perl
$/= "},"; 
while (<>){
   print "$_\n"; 
}'
Run Code Online (Sandbox Code Playgroud)

或者,作为一句:

$ perl -e '$/="},";while(<>){print "$_\n"}' sample.dat 
Run Code Online (Sandbox Code Playgroud)